Galtea allows you to define and use your own custom metrics for evaluation tasks. This is particularly useful for:
  • Deterministic Metrics: When you have custom, rule-based logic to score outputs (e.g., checking for specific keywords, validating JSON structure).
  • Integrating External Models: When you use your own models for evaluation and want to store their scores in Galtea.
The way to do this is by creating a class that inherits from CustomScoreEvaluationMetric.

Single-Turn Evaluation with Custom Metrics

For individual test cases or production logs, you can define your metric and pass an instance of it directly to the create_single_turn method.
import os
from typing import Optional
from galtea import Galtea, CustomScoreEvaluationMetric

galtea = Galtea(api_key=os.getenv("GALTEA_API_KEY"))

# --- Configuration ---
VERSION_ID = "your_version_id"
TEST_CASE_ID = "your_test_case_id"

# 1. Define your custom metric class
class ContainsKeyword(CustomScoreEvaluationMetric):
    def __init__(self, keyword: str):
        self.keyword = keyword.lower()
        super().__init__(
            name=f"contains-{self.keyword}"
        )
        
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        """
        Returns 1.0 if the keyword is in actual_output (case-insensitive), 0.0 otherwise.
        All other args/kwargs are accepted but ignored.
        """
        if actual_output is None:
            return 0.0
        return 1.0 if self.keyword in actual_output.lower() else 0.0

# 2. Instantiate your metric
accuracy_metric = ContainsKeyword(keyword="correct")

# If the metric does not exist yet in the platform, you can create it like this:
galtea.metrics.create(name=accuracy_metric.name, description='checks for the presence of the keyword "correct" in the output')

# Your product's response
actual_output = "This response contains the correct keyword."

# 3. Run evaluation task with your custom metric
# You can mix it with standard metrics as well.
galtea.evaluation_tasks.create_single_turn(
    version_id=VERSION_ID,
    test_case_id=TEST_CASE_ID,
    actual_output=actual_output,
    metrics=[
        "Role Adherence",     # A standard metric
        accuracy_metric      # Your custom metric object
    ],
)

print("Evaluation task with custom metric submitted.")