Evaluation Task
A task that evaluates a group of inference results using a metric type
What is an Evaluation Task?
An evaluation task in Galtea represents the assessment of an evaluation from a session using a the evaluation criteria of a metric type. So multiple evaluation tasks can exist for each evaluation.
The only way to create evaluation tasks is programmatically by using the Galtea SDK but they can be viewed and managed on the Galtea dashboard.
Task Lifecycle
Evaluation tasks follow a specific lifecycle:
Creation
Processing
Completion
Related Concepts
Test
A set of test cases for evaluating product performance
Test Case
Each challenge in a test for evaluating product performance
Evaluation
A group of evaluable Inference Results from a particular session
Metric Type
Ways to evaluate and score product performance
SDK Integration
The Galtea SDK allows you to create, view, and manage evaluation tasks programmatically. This is particularly useful for organizations that want to automate their versioning process or integrate it into their CI/CD pipeline.
Evaluation Task Service SDK
Manage evaluation tasks using the Python SDK
GitHub Actions
Learn how to set up GitHub Actions to automatically evaluate new versions
Evaluation Task Properties
To trigger an evaluation task in Galtea, you provide the following information:
The ID of the version you want to evaluate.
The ID of the session containing the inference results to be evaluated.
A list of metric type names to use for the evaluation. A separate task will be created for each metric.
The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”
The test case to be evaluated. Required for non-production, single-turn evaluations.
The user query that your model needs to answer. Required for production, single-turn evaluations where no test_case_id
is provided.
The context retrieved by your RAG system that was used to generate the actual output.
Including retrieval context enables more comprehensive evaluation of RAG systems.
Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.
Token count information of the LLM call. Use snake_case keys: input_tokens
, output_tokens
, cache_read_input_tokens
.
Evaluation Task Results Properties
Once an evaluation task is created, you can access the following information:
The current status of the evaluation task. Possible values:
- Pending: The task has been created but not yet processed.
- Success: The task was processed successfully.
- Failed: The task encountered an error during processing.
The score assigned to the output by the metric type’s evaluation criteria. Example: 0.85
The explanation of the score assigned to the output by the metric type’s evaluation criteria.
The error message if the task failed during processing.