Returns
Returns a list of Evaluation objects, one for each metric provided.Usage
This method is versatile and can be used for two main scenarios:- Test-Based Evaluation: When you provide a
test_case_id, Galtea evaluates your product’s performance against a predefined challenge. - Production Monitoring: When you set
is_production=Trueand provide aninput, Galtea logs and evaluates your product’s performance in a live environment.
Development Testing
For non self-hosted metricsProduction Monitoring
In order to monitor your product in a production environment, you can create evaluations not linked to a specific test case, but you need to set theis_production flag to True.
Advanced Usage
You can also create evaluations using self-hosted metrics with dynamically calculated scores by utilizing theCustomScoreEvaluationMetric class, which allows for more complex evaluation scenarios.
Parameters
The ID of the version you want to evaluate.
A list of metrics to use for the evaluation.The
MetricInput dictionary supports the following keys:id(string, optional): The ID of an existing metricname(string, optional): The name of the metricscore(float | CustomScoreEvaluationMetric, optional): For self-hosted metrics only- If
float: Pre-computed score (0.0 to 1.0). Requiresidornamein the dict. - If
CustomScoreEvaluationMetric: Score will be calculated dynamically. The CustomScoreEvaluationMetric instance must be initialized withnameorid. Do NOT provideidornamein the dict when using this option.
- If
For self-hosted metrics, both score options are equally valid: pre-compute as a float, or use CustomScoreEvaluationMetric for dynamic calculation. Galtea-hosted metrics automatically compute scores and should not include a
score field.The actual output produced by the product.
The ID of the test case to be evaluated. Required for non-production evaluations.
The input text/prompt. Required for production evaluations where no
test_case_id is provided.Set to
True to indicate the evaluation is from a production environment. Defaults to False.The context retrieved by your RAG system that was used to generate the
actual_output.Time in milliseconds from the request to the LLM until the response was received.
Information about token usage during the model call.
Possible keys include:
input_tokens: Number of input tokens sent to the model.output_tokens: Number of output tokens generated by the model.cache_read_input_tokens: Number of input tokens read from the cache.
Information about the cost per token during the model call.
Possible keys include:
cost_per_input_token: Cost per input token sent to the model.cost_per_output_token: Cost per output token generated by the model.cost_per_cache_read_input_token: Cost per input token read from the cache.
The version of Galtea’s conversation simulator used to generate the user message (input). This should only be provided if the input was generated using the
conversation_simulator_service.