Skip to main content

What is an Evaluation?

An evaluation in Galtea represents the assessment of inference results from a session using the evaluation criteria of a metric. Evaluations are now directly linked to sessions, allowing for comprehensive evaluation of full sessions containing multiple inference results for multi-turn dialogues.
Evaluations do not call your AI product — they score outputs that have already been generated. Run your product first to produce outputs, then trigger the evaluation to score them.
You can create evaluations from the Galtea dashboard (using Endpoint Connections) or programmatically using the Galtea SDK.

Evaluation Lifecycle

Evaluations follow a specific lifecycle:
1

Creation

Trigger an evaluation. It will appear in the session’s details page with the status “pending”
2

Processing

Galtea’s evaluation system processes the evaluation using the evaluation criteria of the selected metric
3

Completion

Once processed, the status changes to “completed” and the results are available

SDK Integration

The Galtea SDK allows you to create, view, and manage evaluations programmatically. This is particularly useful for organizations that want to automate their versioning process or integrate it into their CI/CD pipeline.

Evaluation Properties

The properties required for an evaluation depend on which method you use:

For create_single_turn() (Single-Turn Evaluations)

Version ID
string
required
The ID of the version you want to evaluate.
Session ID
string
required
The ID of the session containing the inference results to be evaluated.
Metrics IDs
list[Metrics]
required
A list of the metrics used for the evaluation.
Actual Output
Text
required
The actual output produced by the product’s version. Example: “The iPhone 16 costs $950.”
Test Case ID
Test Case
The test case to be evaluated. Required for non-production evaluations.
Input
Text
The user query that your model needs to answer. Required for production evaluations where no test_case_id is provided.
Is Production
boolean
Set to True to indicate the evaluation is from a production environment. Defaults to False.
Retrieval Context
Text
The context retrieved by your RAG system that was used to generate the actual output.
Including retrieval context enables more comprehensive evaluation of RAG systems.
Latency
float
Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.
Usage Info
Object
Token count information of the LLM call. Use snake_case keys: input_tokens, output_tokens, cache_read_input_tokens.
Cost Info
Object
The costs associated with the LLM call. Keys may include cost_per_input_token, cost_per_output_token, etc.
If cost information is properly configured in the Model selected by the Version, the system will automatically calculate the cost. Provided values will override the system’s calculation.

Evaluation Results Properties

Once an evaluation is created, you can access the following information:
Status
Enum
The current status of the evaluation. Possible values:
  • Pending: The evaluation has been created but not yet processed.
  • Pending Human: The evaluation is waiting for a human annotator to review and score it. This status is used for metrics with source: "human_evaluation".
  • Success: The evaluation was processed successfully.
  • Failed: The evaluation encountered an error during processing. Check the canRetry field to determine if this evaluation can be retried.
  • Skipped: The evaluation was skipped because the metric validation failed (e.g., missing required parameters) or due to insufficient credits. The error message contains details about what was missing. Check the canRetry field to determine if this evaluation can be retried.
Score
Number
The score assigned to the output by the metric’s evaluation criteria. Example: 0.85
Reason
Text
The explanation of the score assigned to the output by the metric’s evaluation criteria.
Error
Text
The error message if the evaluation failed during processing.
Retries Attempted
Number
The number of retry attempts made for this evaluation. Starts at 0 and increments each time the evaluation is retried. Used to track retry history and enforce retry limits.
Can Retry
Boolean
default:"false"
Indicates whether the evaluation can be retried. When true, the evaluation failed or was skipped due to a temporary condition (e.g., evaluation processing error, insufficient credits) and can be retried later. When false, the evaluation cannot be retried. Only evaluations with canRetry=true are eligible for the retry operation. Defaults to false.
Human Evaluator ID
Text
The ID of the human evaluator assigned to this evaluation, if applicable.
Human Evaluator Started At
Text
The timestamp when the human evaluator started reviewing this evaluation, if applicable.