What is an Evaluation Task?

An evaluation task in Galtea represents the assessment of an evaluation from a session using a the evaluation criteria of a metric type. So multiple evaluation tasks can exist for each evaluation.
Evaluation tasks don’t perform inference on the LLM product themselves. Rather, they evaluate outputs that have already been generated. You should perform inference on your product first, then trigger the evaluation task.
The only way to create evaluation tasks is programmatically by using the Galtea SDK but they can be viewed and managed on the Galtea dashboard.

Task Lifecycle

Evaluation tasks follow a specific lifecycle:
1

Creation

Trigger an evaluation task. It will appear in the Evaluation’s details page with the status “pending”
2

Processing

Galtea’s evaluation system processes the task using the evaluation criteria of the selected metric type
3

Completion

Once processed, the status changes to “completed” and the results are available

SDK Integration

The Galtea SDK allows you to create, view, and manage evaluation tasks programmatically. This is particularly useful for organizations that want to automate their versioning process or integrate it into their CI/CD pipeline.

Evaluation Task Properties

To trigger an evaluation task in Galtea, you provide the following information:
Version ID
string
required
The ID of the version you want to evaluate.
Session ID
string
required
The ID of the session containing the inference results to be evaluated.
Metrics
list[string | CustomScoreEvaluationMetric]
required
A list specifying the metrics to use for the evaluation. Each element should be a string or a CustomScoreEvaluationMetric object.
Actual Output
Text
required
The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”
Test Case ID
Test Case
The test case to be evaluated. Required for non-production, single-turn evaluations.
Input
Text
The user query that your model needs to answer. Required for production, single-turn evaluations where no test_case_id is provided.
Retrieval Context
Text
The context retrieved by your RAG system that was used to generate the actual output.
Including retrieval context enables more comprehensive evaluation of RAG systems.
Latency
float
Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.
Usage Info
Object
Token count information of the LLM call. Use snake_case keys: input_tokens, output_tokens, cache_read_input_tokens.
Cost Info
Object
The costs associated with the LLM call. Keys may include cost_per_input_token, cost_per_output_token, etc.
If cost information is properly configured in the Model selected by the Version, the system will automatically calculate the cost. Provided values will override the system’s calculation.

Evaluation Task Results Properties

Once an evaluation task is created, you can access the following information:
Status
Enum
The current status of the evaluation task. Possible values:
  • Pending: The task has been created but not yet processed.
  • Success: The task was processed successfully.
  • Failed: The task encountered an error during processing.
Score
Number
The score assigned to the output by the metric type’s evaluation criteria. Example: 0.85
Reason
Text
The explanation of the score assigned to the output by the metric type’s evaluation criteria.
Error
Text
The error message if the task failed during processing.