Returns

Returns a list of EvaluationTask objects, one for each metric provided.

Usage

This method is versatile and can be used for two main scenarios:

  1. Test-Based Evaluation: When you provide a test_case_id, Galtea evaluates your product’s performance against a predefined challenge.
  2. Production Monitoring: When you set is_production=True and provide an input, Galtea logs and evaluates real user interactions.

Example: Test-Based Evaluation

evaluation_tasks = galtea.evaluation_tasks.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    metrics=["accuracy_v1", "coherence-v1"],
    actual_output="This is the model's response to the test case."
)

Example: Production Monitoring

evaluation_tasks = galtea.evaluation_tasks.create_single_turn(
    version_id="YOUR_VERSION_ID",
    is_production=True,
    input="A real user's query from production",
    actual_output="The model's live response.",
    metrics=["relevance", "non-toxic"],
    retrieval_context="Context retrieved by RAG for this query."
)

Parameters

version_id
string
required

The ID of the version you want to evaluate.

metrics
list[string]
required

A list of metric type names to use for the evaluation. A separate task will be created for each metric.

actual_output
string
required

The actual output produced by the product.

test_case_id
string

The ID of the test case to be evaluated. Required for non-production evaluations.

input
string

The input text/prompt. Required for production evaluations where no test_case_id is provided.

is_production
boolean

Set to True to indicate the task is from a production environment. Defaults to False.

scores
list[float | None]

A list of pre-computed scores corresponding to the metrics list. Use None for metrics that Galtea should evaluate.

retrieval_context
string

The context retrieved by your RAG system that was used to generate the actual_output.

latency
float

Time in milliseconds from the request to the LLM until the response was received.

usage_info
dict[str, float]

Token usage information (e.g., {"input_tokens": 10, "output_tokens": 5}).

cost_info
dict[str, float]

Cost information for the LLM call.