Skip to main content

Returns

Returns a list of Evaluation objects, one for each metric provided.

Usage

This method is versatile and can be used for two main scenarios:
  1. Test-Based Evaluation: When you provide a test_case_id, Galtea evaluates your product’s performance against a predefined challenge.
  2. Production Monitoring: When you set is_production=True and provide an input, Galtea logs and evaluates real user interactions.
The recommended way to specify metrics in SDK v3.0 is using the MetricInput dictionary format. For self-hosted metrics, you have two equally valid options: Option 1: Pre-computed scores
from galtea import Galtea

galtea = Galtea(api_key="YOUR_API_KEY")

# Pre-compute your custom score
def check_accuracy(output: str) -> float:
    return 1.0 if "paris" in output.lower() else 0.0

actual_output = "The capital of France is Paris."
custom_score = check_accuracy(actual_output)

# Create the metric in the platform if it doesn't exist yet
galtea.metrics.create(
    name="my-accuracy",
    source="self_hosted",
    description="Checks if 'paris' appears in the output"
)

evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    actual_output=actual_output,
    metrics=[
        {"name": "Role Adherence"},                    # Galtea-hosted metric
        {"name": "my-accuracy", "score": custom_score} # Self-hosted with pre-computed score
    ]
)
Option 2: CustomScoreEvaluationMetric for dynamic scoring
from galtea import Galtea, CustomScoreEvaluationMetric

galtea = Galtea(api_key="YOUR_API_KEY")

# Define scoring logic as a class
class MyAccuracy(CustomScoreEvaluationMetric):
    def __init__(self):
        super().__init__(name="my-accuracy")
        
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        if actual_output is None:
            return 0.0
        return 1.0 if "paris" in actual_output.lower() else 0.0

# Create the metric in the platform if it doesn't exist yet
galtea.metrics.create(
    name="my-accuracy",
    source="self_hosted",
    description="Checks if 'paris' appears in the output"
)

evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    actual_output="The capital of France is Paris.",
    metrics=[
        {"name": "Role Adherence"},         # Galtea-hosted metric
        {"score": MyAccuracy()}             # Self-hosted with dynamic scoring
        # Note: No 'name' or 'id' in dict - it comes from MyAccuracy(name="...")
    ]
)
Both options are equally valid for self-hosted metrics. Choose based on your preference: pre-compute for simplicity, or use CustomScoreEvaluationMetric for encapsulation and reusability.
The following format is maintained for backward compatibility only. New code should use the MetricInput dictionary format shown above.
from galtea import Galtea, CustomScoreEvaluationMetric

galtea = Galtea(api_key="YOUR_API_KEY")

# Define your custom metric
class MyAccuracy(CustomScoreEvaluationMetric):
    def __init__(self):
        super().__init__(name="my-accuracy")
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        if actual_output is None:
            return 0.0
        return 1.0 if "paris" in actual_output.lower() else 0.0

custom_score_accuracy = MyAccuracy()

# Legacy format - passing directly without MetricInput dict wrapper
evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    actual_output="The capital of France is Paris.",
    metrics=[
        "Role Adherence",       # Legacy: string format
        custom_score_accuracy   # Legacy: CustomScoreEvaluationMetric directly
    ]
)

Example: Production Monitoring

from galtea import Galtea

galtea = Galtea(api_key="YOUR_API_KEY")

# Using the recommended MetricInput dictionary format
evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    is_production=True,
    input="A real user's query from production",
    actual_output="The model's live response.",
    metrics=[
        {"name": "Contextual Relevancy"}, 
        {"name": "Non-Toxic"}
    ],
    retrieval_context="Context retrieved by RAG for this query."
)

# Alternative: String format (legacy, still supported)
evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    is_production=True,
    input="A real user's query from production",
    actual_output="The model's live response.",
    metrics=["Contextual Relevancy", "Non-Toxic"],  # Legacy string format
    retrieval_context="Context retrieved by RAG for this query."
)

Parameters

version_id
string
required
The ID of the version you want to evaluate.
metrics
List[Union[str, CustomScoreEvaluationMetric, Dict]]
required
A list of metrics to use for evaluation.Recommended: MetricInput dictionary format:
metrics=[
    {"name": "Role Adherence"},              # Galtea-hosted metric by name
    {"id": "metric_xyz"},                    # Galtea-hosted metric by ID
    {"name": "custom", "score": 0.95},       # Self-hosted with pre-computed score
    {"score": CustomScoreEvaluationMetric(name="custom")}  # Self-hosted with dynamic scoring
]
Also supported (legacy):
  • By name (string): metrics=["Role Adherence"]
  • By custom class (top-level): metrics=[MyCustomMetric()]
The MetricInput dictionary supports the following keys:
  • id (string, optional): The ID of an existing metric
  • name (string, optional): The name of the metric
  • score (float | CustomScoreEvaluationMetric, optional): For self-hosted metrics only
    • If float: Pre-computed score (0.0 to 1.0). Requires id or name in the dict.
    • If CustomScoreEvaluationMetric: Score will be calculated dynamically. The CustomScoreEvaluationMetric instance must be initialized with name or id. Do NOT provide id or name in the dict when using this option.
For self-hosted metrics, both score options are equally valid: pre-compute as a float, or use CustomScoreEvaluationMetric for dynamic calculation. Galtea-hosted metrics automatically compute scores and should not include a score field.
actual_output
string
required
The actual output produced by the product.
test_case_id
string
The ID of the test case to be evaluated. Required for non-production evaluations.
input
string
The input text/prompt. Required for production evaluations where no test_case_id is provided.
is_production
boolean
Set to True to indicate the evaluation is from a production environment. Defaults to False.
retrieval_context
string
The context retrieved by your RAG system that was used to generate the actual_output.
latency
float
Time in milliseconds from the request to the LLM until the response was received.
usage_info
dict[str, int]
Information about token usage during the model call. Possible keys include:
  • input_tokens: Number of input tokens sent to the model.
  • output_tokens: Number of output tokens generated by the model.
  • cache_read_input_tokens: Number of input tokens read from the cache.
cost_info
dict[str, float]
Information about the cost per token during the model call. Possible keys include:
  • cost_per_input_token: Cost per input token sent to the model.
  • cost_per_output_token: Cost per output token generated by the model.
  • cost_per_cache_read_input_token: Cost per input token read from the cache.
conversation_simulator_version
string
The version of Galtea’s conversation simulator used to generate the user message (input). This should only be provided if the input was generated using the conversation_simulator_service.