Create Single-Turn Evaluation Task

Returns

Returns a list of EvaluationTask objects, one for each metric provided.

Usage

This method is versatile and can be used for two main scenarios:

Test-Based Evaluation: When you provide a test_case_id, Galtea evaluates your product’s performance against a predefined challenge.
Production Monitoring: When you set is_production=True and provide an input, Galtea logs and evaluates real user interactions.

Example: Test-Based Evaluation with Standard and Custom Metrics

from galtea import Galtea, CustomScoreEvaluationMetric

# Define your custom metric
class MyAccuracy(CustomScoreEvaluationMetric):
    def __init__(self):
        super().__init__(name="my-accuracy")
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        """
        Returns 1.0 if 'paris' is in actual_output (case-insensitive), else 0.0.
        All other args/kwargs are accepted but ignored.
        """
        if actual_output is None:
            return 0.0
        return 1.0 if "paris" in actual_output.lower() else 0.0

custom_score_accuracy = MyAccuracy()

# If the metric does not exist yet in the platform, you can create it like this:
galtea.metrics.create(name=custom_score_accuracy.name, description='checks for the presence of the keyword "paris" in the output')

# Create evaluation tasks using a mix of standard and custom metrics
evaluation_tasks = galtea.evaluation_tasks.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    actual_output="The capital of France is Paris.",
    metrics=[
        "coherence",            # Standard metric
        custom_score_accuracy   # Your custom metric object
    ]
)

Example: Production Monitoring

evaluation_tasks = galtea.evaluation_tasks.create_single_turn(
    version_id="YOUR_VERSION_ID",
    is_production=True,
    input="A real user's query from production",
    actual_output="The model's live response.",
    metrics=["relevance", "non-toxic"],
    retrieval_context="Context retrieved by RAG for this query."
)

Parameters

version_id

string

required

The ID of the version you want to evaluate.

metrics

List[string | CustomScoreEvaluationMetric]

required

A list of metrics to use for evaluation. You can provide:

Standard metrics as strings (e.g., “coherence”).
Custom, locally-scored metrics as objects inheriting from CustomScoreEvaluationMetric.

actual_output

string

required

The actual output produced by the product.

test_case_id

string

The ID of the test case to be evaluated. Required for non-production evaluations.

input

string

The input text/prompt. Required for production evaluations where no test_case_id is provided.

is_production

boolean

Set to True to indicate the task is from a production environment. Defaults to False.

retrieval_context

string

The context retrieved by your RAG system that was used to generate the actual_output.

latency

float

Time in milliseconds from the request to the LLM until the response was received.

usage_info

dict[str, float]

Token usage information (e.g., {"input_tokens": 10, "output_tokens": 5}).

cost_info

dict[str, float]

Cost information for the LLM call.

SDK

API

Create Single-Turn Evaluation Task

Returns

Usage

Example: Test-Based Evaluation with Standard and Custom Metrics

Example: Production Monitoring

Parameters

SDK

API

​Returns

​Usage

​Example: Test-Based Evaluation with Standard and Custom Metrics

​Example: Production Monitoring

​Parameters

Returns

Usage

Example: Test-Based Evaluation with Standard and Custom Metrics

Example: Production Monitoring

Parameters