What is an Evaluation Task?

An evaluation task in Galtea represents the assessment of a single test case from a test using a the evaluation criteria of a metric. Multiple evaluation tasks make up an evaluation.

Evaluation tasks don’t perform inference on the LLM product themselves. Rather, they evaluate outputs that have already been generated. You should perform inference on your product first, then trigger the evaluation task with both the test case’s data and the resulting output.

The only way to create evaluation tasks is programmatically by using the Galtea SDK but they can be viewed and managed on the Galtea dashboard.

Task Lifecycle

Evaluation tasks follow a specific lifecycle:

1

Inference

Perform inference on the LLM product to generate an output using the test case’s data
2

Creation

Trigger an evaluation task. It will appear in the Evaluation’s details page with the status “pending”
3

Processing

Galtea’s evaluation system processes the task using the evaluation criteria of the selected metric type
4

Completion

Once processed, the status changes to “completed” and the results are available

SDK Integration

The Galtea SDK allows you to create, view, and manage evaluation tasks programmatically. This is particularly useful for organizations that want to automate their versioning process or integrate it into their CI/CD pipeline.

Evaluation Tasks from Production

For evaluation tasks created from production data you can use the SDK method create_from_production. It allows the user to supply the conversation history and more data directly without the need of a test_case_id.

Evaluation Task Properties

To trigger an evaluation task in Galtea, you can provide the following information:

Evaluation
Evaluation
required

The evaluation to which this task belongs.

Test Case
Test Case
required

The test case to be evaluated.

Metric Type
Metric Type
required

The metric type used for the evaluation task. This defines the evaluation criteria and scoring method.

Actual Output
Text
required

The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”

Retrieval Context
Text

The context retrieved by your RAG system that was used to generate the actual output.

Including retrieval context enables more comprehensive evaluation of RAG systems, allowing for assessment of both retrieval and generation capabilities.

Latency
float

Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.

Usage Info
Object

Token count information of the LLM call. Use snake_case keys: input_tokens, output_tokens, cache_read_input_tokens.

Token counts are calculated based on the LLM provider’s tokenization method and are usually returned in the response of the endpoint.

Cost Info
Object

The costs associated with the LLM call. May include:

  • cost_per_input_token: Cost for each input token read by the LLM.
  • cost_per_output_token: Cost for each output token generated by the LLM.
  • cost_per_cache_read_input_token: Cost for each input token read from cache.

Costs are calculated based on the pricing model of the LLM provider.

If cost information is properly configured in the Model selected by the Version, the system will automatically calculate the cost of the evaluation task and you don’t need to provide this information. However, if provided, it will override the system’s calculation.

Input
Text
deprecated

The user query that your model needs to answer. Deprecated, it is now part of the test case.

Expected Output
Text
deprecated

Expected output for the evaluation task. Deprecated, it is now part of the test case.

Context
Text
deprecated

Additional data provided to your LLM application. Deprecated, it is now part of the test case.

Conversation Turns
List[Dict[Text, Text]]
deprecated

For conversational AI, this field can store a list of previous turns in the conversation. Each turn is a dictionary with “input” and “actual_output” keys.

Evaluation Task Results Properties

Once an evaluation task is created, you can access the following information:

Status
Enum

The current status of the evaluation task. Possible values:

  • Pending: The task has been created but not yet processed.
  • Success: The task was processed successfully.
  • Failed: The task encountered an error during processing.
Score
Number

The score assigned to the output by the metric type’s evaluation criteria. Example: 0.85

Reason
Text

The explanation of the score assigned to the output by the metric type’s evaluation criteria.

Error
Text

The error message if the task failed during processing.