Evaluation Task
The assessment of a specific test case using a metric type
What is an Evaluation Task?
An evaluation task in Galtea represents the assessment of a single test case from a test using a specific evaluation criteria. Multiple evaluation tasks make up an evaluation.
Task Lifecycle
Evaluation tasks follow a specific lifecycle:
Inference
Creation
Processing
Completion
Related Concepts
Test
A set of test cases for evaluating product performance
Test Case
Each challenge in a test for evaluating product performance
Evaluation
A link between a product version and a test that groups evaluation tasks
Metric Type
Ways to evaluate and score product performance
SDK Integration
Evaluation Task Service SDK
SDK methods for managing evaluation tasks
Task Creation Properties
To trigger an evaluation task in Galtea, you can provide the following information:
The unique identifier of the evaluation that the task is part of.
The unique identifier of the test case to be evaluated.
The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”
The context documents retrieved by a RAG (Retrieval-Augmented Generation) system that were used to generate the actual output.
This field is only relevant for products that use retrieval-augmented generation.
The unique identifier of the metric type to be used in the evaluation task.
Time elapsed (in milliseconds) from the moment the request was sent to the LLM to the moment the response was received.
Tracking latency helps monitor performance characteristics of your LLM product over time.
Token count information of the LLM call. May include:
- Input Tokens: Number of input tokens used in the LLM call.
- Output Tokens: Number of output tokens generated by the LLM call.
- Cache Read Input Tokens: Number of tokens read by the model from cache.
Token counts are calculated based on the LLM provider’s tokenization method and are usually returned in the response of the endpoint.
The costs associated with the LLM call. May include:
- Cost Per Input Token: Cost for each input token read by the LLM.
- Cost Per Output Token: Cost for each output token generated by the LLM.
- Cost Per Cache Read Input Token: Cost for each input token read from cache.
Costs are calculated based on the pricing model of the LLM provider.
Task Results Properties
Once an evaluation task is completed, you can access the following information:
The current status of the evaluation task.asd Possible values:
- Pending: The task has been created but not yet processed.
- Success: The task was processed successfully.
- Failed: The task encountered an error during processing.
The score assigned to the output by the metric type’s evaluation criteria. Example: 0.85
The explanation of the score assigned to the output by the metric type’s evaluation criteria.
The error message if the task failed during processing.