What is an Evaluation Task?

An evaluation task in Galtea represents the assessment of a single test case from a test using a specific evaluation criteria. Multiple evaluation tasks make up an evaluation.

Evaluation tasks don’t perform inference on the LLM product themselves. Rather, they evaluate outputs that have already been generated. You should perform inference on your product first, then trigger the evaluation task with both the input (provided by the test case) and the resulting output.

Task Lifecycle

Evaluation tasks follow a specific lifecycle:

1

Inference

Perform inference on the LLM product to generate an output using the test case’s input and context
2

Creation

Trigger an evaluation task. It will appear in the Evaluation’s details page with the status “pending”
3

Processing

Galtea’s evaluation system processes the task using evaluation criteria of the selected metric type
4

Completion

Once processed, the status changes to “completed” and the results are available

SDK Integration

Evaluation Task Service SDK

SDK methods for managing evaluation tasks

Task Creation Properties

To trigger an evaluation task in Galtea, you can provide the following information:

Evaluation
ID (text)
required

The unique identifier of the evaluation that the task is part of.

Test Case
ID (text)
required

The unique identifier of the test case to be evaluated.

Actual Output
String
required

The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”

Retrieval Context
List[String]

The context documents retrieved by a RAG (Retrieval-Augmented Generation) system that were used to generate the actual output.

This field is only relevant for products that use retrieval-augmented generation.

Metric Type
ID (text)
required

The unique identifier of the metric type to be used in the evaluation task.

The SDK allows you to use the metric type’s name instead of the ID.

Latency
Number

Time elapsed (in milliseconds) from the moment the request was sent to the LLM to the moment the response was received.

Tracking latency helps monitor performance characteristics of your LLM product over time.

Usage Info
Object

Token count information of the LLM call. May include:

  • Input Tokens: Number of input tokens used in the LLM call.
  • Output Tokens: Number of output tokens generated by the LLM call.
  • Cache Read Input Tokens: Number of tokens read by the model from cache.

Token counts are calculated based on the LLM provider’s tokenization method and are usually returned in the response of the endpoint.

Cost Info
Object

The costs associated with the LLM call. May include:

  • Cost Per Input Token: Cost for each input token read by the LLM.
  • Cost Per Output Token: Cost for each output token generated by the LLM.
  • Cost Per Cache Read Input Token: Cost for each input token read from cache.

Costs are calculated based on the pricing model of the LLM provider.

If cost information is properly configured in the Model selected by the Version, the system will automatically calculate the cost of the evaluation task and you don’t need to provide this information.

Task Results Properties

Once an evaluation task is completed, you can access the following information:

Status
Enum

The current status of the evaluation task.asd Possible values:

  • Pending: The task has been created but not yet processed.
  • Success: The task was processed successfully.
  • Failed: The task encountered an error during processing.
Score
Number

The score assigned to the output by the metric type’s evaluation criteria. Example: 0.85

Reason
Text

The explanation of the score assigned to the output by the metric type’s evaluation criteria.

Error
Text

The error message if the task failed during processing.