Evaluation Task
The assessment of a specific test case using a metric type
What is an Evaluation Task?
An evaluation task in Galtea represents the assessment of a single test case from a test using a the evaluation criteria of a metric. Multiple evaluation tasks make up an evaluation.
The only way to create evaluation tasks is programmatically by using the Galtea SDK but they can be viewed and managed on the Galtea dashboard.
Task Lifecycle
Evaluation tasks follow a specific lifecycle:
Inference
Creation
Processing
Completion
Related Concepts
Test
A set of test cases for evaluating product performance
Test Case
Each challenge in a test for evaluating product performance
Evaluation
A link between a product version and a test that groups evaluation tasks
Metric Type
Ways to evaluate and score product performance
SDK Integration
The Galtea SDK allows you to create, view, and manage evaluation tasks programmatically. This is particularly useful for organizations that want to automate their versioning process or integrate it into their CI/CD pipeline.
Evaluation Task Service SDK
Manage evaluation tasks using the Python SDK
GitHub Actions
Learn how to set up GitHub Actions to automatically evaluate new versions
Evaluation Tasks from Production
For evaluation tasks created from production data you can use the SDK method create_from_production
.
It allows the user to supply the conversation history and more data directly without the need of a test_case_id
.
Evaluation Task Properties
To trigger an evaluation task in Galtea, you can provide the following information:
The evaluation to which this task belongs.
The test case to be evaluated.
The metric type used for the evaluation task. This defines the evaluation criteria and scoring method.
The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”
The context retrieved by your RAG system that was used to generate the actual output.
Including retrieval context enables more comprehensive evaluation of RAG systems, allowing for assessment of both retrieval and generation capabilities.
Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.
Token count information of the LLM call. Use snake_case keys: input_tokens
, output_tokens
, cache_read_input_tokens
.
Token counts are calculated based on the LLM provider’s tokenization method and are usually returned in the response of the endpoint.
The costs associated with the LLM call. May include:
- cost_per_input_token: Cost for each input token read by the LLM.
- cost_per_output_token: Cost for each output token generated by the LLM.
- cost_per_cache_read_input_token: Cost for each input token read from cache.
Costs are calculated based on the pricing model of the LLM provider.
The user query that your model needs to answer. Deprecated, it is now part of the test case.
Expected output for the evaluation task. Deprecated, it is now part of the test case.
Additional data provided to your LLM application. Deprecated, it is now part of the test case.
For conversational AI, this field can store a list of previous turns in the conversation. Each turn is a dictionary with “input” and “actual_output” keys.
Evaluation Task Results Properties
Once an evaluation task is created, you can access the following information:
The current status of the evaluation task. Possible values:
- Pending: The task has been created but not yet processed.
- Success: The task was processed successfully.
- Failed: The task encountered an error during processing.
The score assigned to the output by the metric type’s evaluation criteria. Example: 0.85
The explanation of the score assigned to the output by the metric type’s evaluation criteria.
The error message if the task failed during processing.