Create Single-Turn Evaluation

Returns

Returns a list of Evaluation objects, one for each metric provided.

Usage

This method is versatile and can be used for two main scenarios:

Test-Based Evaluation: When you provide a test_case_id, Galtea evaluates your product’s performance against a predefined challenge.
Production Monitoring: When you set is_production=True and provide an input, Galtea logs and evaluates real user interactions.

Example: Using the MetricInput Dictionary Format (Recommended)

The recommended way to specify metrics in SDK v3.0 is using the MetricInput dictionary format. For self-hosted metrics, you have two equally valid options: Option 1: Pre-computed scores

from galtea import Galtea

galtea = Galtea(api_key="YOUR_API_KEY")

# Pre-compute your custom score
def check_accuracy(output: str) -> float:
    return 1.0 if "paris" in output.lower() else 0.0

actual_output = "The capital of France is Paris."
custom_score = check_accuracy(actual_output)

# Create the metric in the platform if it doesn't exist yet
galtea.metrics.create(
    name="my-accuracy",
    source="self_hosted",
    description="Checks if 'paris' appears in the output"
)

evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    actual_output=actual_output,
    metrics=[
        {"name": "Role Adherence"},                    # Galtea-hosted metric
        {"name": "my-accuracy", "score": custom_score} # Self-hosted with pre-computed score
    ]
)

Option 2: CustomScoreEvaluationMetric for dynamic scoring

from galtea import Galtea, CustomScoreEvaluationMetric

galtea = Galtea(api_key="YOUR_API_KEY")

# Define scoring logic as a class
class MyAccuracy(CustomScoreEvaluationMetric):
    def __init__(self):
        super().__init__(name="my-accuracy")
        
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        if actual_output is None:
            return 0.0
        return 1.0 if "paris" in actual_output.lower() else 0.0

# Create the metric in the platform if it doesn't exist yet
galtea.metrics.create(
    name="my-accuracy",
    source="self_hosted",
    description="Checks if 'paris' appears in the output"
)

evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    actual_output="The capital of France is Paris.",
    metrics=[
        {"name": "Role Adherence"},         # Galtea-hosted metric
        {"score": MyAccuracy()}             # Self-hosted with dynamic scoring
        # Note: No 'name' or 'id' in dict - it comes from MyAccuracy(name="...")
    ]
)

Both options are equally valid for self-hosted metrics. Choose based on your preference: pre-compute for simplicity, or use CustomScoreEvaluationMetric for encapsulation and reusability.

Example: Legacy Format (Not Recommended)

The following format is maintained for backward compatibility only. New code should use the MetricInput dictionary format shown above.

from galtea import Galtea, CustomScoreEvaluationMetric

galtea = Galtea(api_key="YOUR_API_KEY")

# Define your custom metric
class MyAccuracy(CustomScoreEvaluationMetric):
    def __init__(self):
        super().__init__(name="my-accuracy")
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        if actual_output is None:
            return 0.0
        return 1.0 if "paris" in actual_output.lower() else 0.0

custom_score_accuracy = MyAccuracy()

# Legacy format - passing directly without MetricInput dict wrapper
evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    actual_output="The capital of France is Paris.",
    metrics=[
        "Role Adherence",       # Legacy: string format
        custom_score_accuracy   # Legacy: CustomScoreEvaluationMetric directly
    ]
)

Example: Production Monitoring

from galtea import Galtea

galtea = Galtea(api_key="YOUR_API_KEY")

# Using the recommended MetricInput dictionary format
evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    is_production=True,
    input="A real user's query from production",
    actual_output="The model's live response.",
    metrics=[
        {"name": "Contextual Relevancy"}, 
        {"name": "Non-Toxic"}
    ],
    retrieval_context="Context retrieved by RAG for this query."
)

# Alternative: String format (legacy, still supported)
evaluations = galtea.evaluations.create_single_turn(
    version_id="YOUR_VERSION_ID",
    is_production=True,
    input="A real user's query from production",
    actual_output="The model's live response.",
    metrics=["Contextual Relevancy", "Non-Toxic"],  # Legacy string format
    retrieval_context="Context retrieved by RAG for this query."
)

Parameters

version_id

string

required

The ID of the version you want to evaluate.

metrics

List[Union[str, CustomScoreEvaluationMetric, Dict]]

required

A list of metrics to use for evaluation.Recommended: MetricInput dictionary format:

metrics=[
    {"name": "Role Adherence"},              # Galtea-hosted metric by name
    {"id": "metric_xyz"},                    # Galtea-hosted metric by ID
    {"name": "custom", "score": 0.95},       # Self-hosted with pre-computed score
    {"score": CustomScoreEvaluationMetric(name="custom")}  # Self-hosted with dynamic scoring
]

Also supported (legacy):

By name (string): metrics=["Role Adherence"]
By custom class (top-level): metrics=[MyCustomMetric()]

The MetricInput dictionary supports the following keys:

id (string, optional): The ID of an existing metric
name (string, optional): The name of the metric
score (float | CustomScoreEvaluationMetric, optional): For self-hosted metrics only
- If float: Pre-computed score (0.0 to 1.0). Requires id or name in the dict.
- If CustomScoreEvaluationMetric: Score will be calculated dynamically. The CustomScoreEvaluationMetric instance must be initialized with name or id. Do NOT provide id or name in the dict when using this option.

For self-hosted metrics, both score options are equally valid: pre-compute as a float, or use CustomScoreEvaluationMetric for dynamic calculation. Galtea-hosted metrics automatically compute scores and should not include a score field.

actual_output

string

required

The actual output produced by the product.

test_case_id

string

The ID of the test case to be evaluated. Required for non-production evaluations.

input

string

The input text/prompt. Required for production evaluations where no test_case_id is provided.

is_production

boolean

Set to True to indicate the evaluation is from a production environment. Defaults to False.

retrieval_context

string

The context retrieved by your RAG system that was used to generate the actual_output.

latency

float

Time in milliseconds from the request to the LLM until the response was received.

usage_info

dict[str, int]

Information about token usage during the model call. Possible keys include:

input_tokens: Number of input tokens sent to the model.
output_tokens: Number of output tokens generated by the model.
cache_read_input_tokens: Number of input tokens read from the cache.

cost_info

dict[str, float]

Information about the cost per token during the model call. Possible keys include:

cost_per_input_token: Cost per input token sent to the model.
cost_per_output_token: Cost per output token generated by the model.
cost_per_cache_read_input_token: Cost per input token read from the cache.

conversation_simulator_version

string

The version of Galtea’s conversation simulator used to generate the user message (input). This should only be provided if the input was generated using the conversation_simulator_service.

SDK

API

​Returns

​Usage

​Example: Using the MetricInput Dictionary Format (Recommended)

​Example: Legacy Format (Not Recommended)

​Example: Production Monitoring

​Parameters

Returns

Usage

Example: Using the MetricInput Dictionary Format (Recommended)

Example: Legacy Format (Not Recommended)

Example: Production Monitoring

Parameters