> ## Documentation Index
> Fetch the complete documentation index at: https://docs.galtea.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Evaluate with Custom Metrics

> Learn how to run evaluations with your own custom, self-hosted metrics.

Galtea allows you to define and use your own custom metrics for evaluations. This is particularly useful for:

* **Deterministic Metrics**: When you have custom, rule-based logic to score outputs (e.g., checking for specific keywords, validating JSON structure).
* **Integrating External Models**: When you use your own models for evaluation and want to store their scores in Galtea.

## Recommended Approach: MetricInput Dictionary

The recommended way to specify metrics in SDK v3.0 is using the `MetricInput` dictionary format. For self-hosted metrics, you have two equally valid options for providing scores:

### Option 1: Pre-Compute the Score

If you want to calculate the score yourself before creating the evaluation, you can provide the score directly as a float:

```python theme={"system"}
# Your product's response
actual_output = "This response contains the correct keyword."


# Define your custom scoring logic
def contains_keyword(text: str, keyword: str) -> float:
    """Returns 1.0 if the keyword is in text (case-insensitive), 0.0 otherwise."""
    return 1.0 if keyword.lower() in text.lower() else 0.0


# Compute the score
custom_score = contains_keyword(actual_output, "correct")

# Create the metric if it doesn't exist yet
CUSTOM_METRIC_NAME = "contains-correct"
galtea.metrics.create(
    name=CUSTOM_METRIC_NAME,
    source="self_hosted",
    description='Checks for the presence of the keyword "correct" in the output',
)

# Run evaluation with your pre-computed score
session = galtea.sessions.create(version_id=version_id, test_case_id=test_case_id)
galtea.inference_results.create_and_evaluate(
    session_id=session.id,
    output=actual_output,
    metrics=[
        {"name": "Role Adherence"},  # Standard Galtea metric
        {
            "name": CUSTOM_METRIC_NAME,
            "score": custom_score,
        },  # Self-hosted with pre-computed score
    ],
)
```

### Option 2: Use CustomScoreEvaluationMetric Class

If you prefer to encapsulate your scoring logic in a class that will be executed dynamically, you can use the `CustomScoreEvaluationMetric` class within the MetricInput dictionary:

```python theme={"system"}
# Define your custom metric class
class ContainsKeyword(CustomScoreEvaluationMetric):
    def __init__(self, keyword: str):
        self.keyword = keyword.lower()
        # Initialize with the metric name or ID
        super().__init__(name=f"contains-{self.keyword}")

    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        """
        Returns 1.0 if the keyword is in actual_output (case-insensitive), 0.0 otherwise.
        All other args/kwargs are accepted but ignored.
        """
        if actual_output is None:
            return 0.0
        return 1.0 if self.keyword in actual_output.lower() else 0.0


# Instantiate your metric
accuracy_metric = ContainsKeyword(keyword="relevant")

# Create the metric in the platform if it doesn't exist yet
galtea.metrics.create(
    name=accuracy_metric.name,
    source="self_hosted",
    description='Checks for the presence of the keyword "relevant" in the output',
)

# Your product's response
actual_output_2 = "This response is relevant and helpful."

# Run evaluation with your custom metric class
# Important: Do NOT provide 'id' or 'name' in the dict when using CustomScoreEvaluationMetric
session_2 = galtea.sessions.create(version_id=version_id, test_case_id=test_case_id)
galtea.inference_results.create_and_evaluate(
    session_id=session_2.id,
    output=actual_output_2,
    metrics=[
        {"name": "Role Adherence"},  # Standard Galtea metric
        {"score": accuracy_metric},  # Self-hosted with dynamic scoring
    ],
)
```

<Note>
  When using `CustomScoreEvaluationMetric` as the `score` field in a MetricInput dictionary, do NOT provide `id` or `name` in the dictionary itself. The metric identifier must be specified when initializing the CustomScoreEvaluationMetric instance (e.g., `CustomScoreEvaluationMetric(name="my-metric")`).
</Note>

## Choosing Between Options

Both approaches are equally valid and current. Choose based on your preference:

* **Use Option 1 (Pre-Computed Score)** if:
  * You prefer a simpler, more declarative style
  * Your scoring logic is straightforward and doesn't require encapsulation
  * You want to separate score calculation from the evaluation submission

* **Use Option 2 (CustomScoreEvaluationMetric Class)** if:
  * You prefer object-oriented design
  * Your scoring logic is complex and benefits from encapsulation
  * You want the SDK to handle score calculation automatically
  * You need to reuse the same metric logic across multiple evaluations

## Multi-Turn Custom Metrics

When using `CustomScoreEvaluationMetric`, your `measure()` method always receives an `inference_results` parameter with `InferenceResult` objects. For session evaluations (`session_id=...`) this includes all turns; for single inference result evaluations (`inference_result_id=...`) it contains one item. This enables conversation-level scoring such as consistency checks, cross-turn analysis, or aggregated metrics.

```python theme={"system"}
from galtea import InferenceResult


class AllOutputsContainKeyword(CustomScoreEvaluationMetric):
    """Checks if a keyword appears in every assistant response across the conversation."""

    def __init__(self, keyword: str):
        self.keyword = keyword.lower()
        super().__init__(name=f"all-outputs-contain-{self.keyword}")

    def measure(
        self, *args, actual_output: str | None = None, inference_results: list[InferenceResult] | None = None, **kwargs
    ) -> float:
        """
        Uses inference_results to check every turn in the conversation.
        Falls back to actual_output when inference_results is not available.
        """
        if inference_results:
            outputs = [ir.actual_output for ir in inference_results if ir.actual_output]
            if not outputs:
                return 0.0
            matches = sum(1 for output in outputs if self.keyword in output.lower())
            return matches / len(outputs)
        # Fallback for when inference_results is not provided
        if not actual_output:
            return 0.0
        return 1.0 if self.keyword in actual_output.lower() else 0.0


# Use with session-based evaluation for multi-turn scoring
# galtea.evaluations.create(
#     session_id="your_session_id",
#     metrics=[{"score": AllOutputsContainKeyword(keyword="helpful")}],
# )
```

<Info>
  The `inference_results` parameter is always a list ordered chronologically. For single-turn evaluations (via `inference_result_id`), it contains one item. For session evaluations, it contains all turns. See [Evaluating Conversations](/sdk/tutorials/evaluating-conversations) for the full session-based workflow.
</Info>

## Next Steps

<CardGroup cols={2}>
  <Card title="Create Your Own Judge Prompt" icon="wand-magic-sparkles" href="/sdk/tutorials/how-to-create-your-llm-as-a-judge-prompt">
    Write custom LLM-as-a-judge prompts for AI Evaluation metrics.
  </Card>

  <Card title="Evaluating Conversations" icon="comments" href="/sdk/tutorials/evaluating-conversations">
    Apply custom metrics to multi-turn conversation evaluations.
  </Card>
</CardGroup>
