Returns
Returns a list of Evaluation objects, one for each metric provided.Usage
This method is versatile and can be used for two main scenarios:- Test-Based Evaluation: When you provide a
test_case_id, Galtea evaluates your product’s performance against a predefined challenge. - Production Monitoring: When you set
is_production=Trueand provide aninput, Galtea logs and evaluates real user interactions.
Example: Using the MetricInput Dictionary Format (Recommended)
The recommended way to specify metrics in SDK v3.0 is using theMetricInput dictionary format. For self-hosted metrics, you have two equally valid options:
Option 1: Pre-computed scores
Both options are equally valid for self-hosted metrics. Choose based on your preference: pre-compute for simplicity, or use CustomScoreEvaluationMetric for encapsulation and reusability.
Example: Legacy Format (Not Recommended)
Example: Production Monitoring
Parameters
The ID of the version you want to evaluate.
A list of metrics to use for evaluation.Recommended: Also supported (legacy):
MetricInput dictionary format:- By name (string):
metrics=["Role Adherence"] - By custom class (top-level):
metrics=[MyCustomMetric()]
MetricInput dictionary supports the following keys:id(string, optional): The ID of an existing metricname(string, optional): The name of the metricscore(float | CustomScoreEvaluationMetric, optional): For self-hosted metrics only- If
float: Pre-computed score (0.0 to 1.0). Requiresidornamein the dict. - If
CustomScoreEvaluationMetric: Score will be calculated dynamically. The CustomScoreEvaluationMetric instance must be initialized withnameorid. Do NOT provideidornamein the dict when using this option.
- If
For self-hosted metrics, both score options are equally valid: pre-compute as a float, or use CustomScoreEvaluationMetric for dynamic calculation. Galtea-hosted metrics automatically compute scores and should not include a
score field.The actual output produced by the product.
The ID of the test case to be evaluated. Required for non-production evaluations.
The input text/prompt. Required for production evaluations where no
test_case_id is provided.Set to
True to indicate the evaluation is from a production environment. Defaults to False.The context retrieved by your RAG system that was used to generate the
actual_output.Time in milliseconds from the request to the LLM until the response was received.
Information about token usage during the model call.
Possible keys include:
input_tokens: Number of input tokens sent to the model.output_tokens: Number of output tokens generated by the model.cache_read_input_tokens: Number of input tokens read from the cache.
Information about the cost per token during the model call.
Possible keys include:
cost_per_input_token: Cost per input token sent to the model.cost_per_output_token: Cost per output token generated by the model.cost_per_cache_read_input_token: Cost per input token read from the cache.
The version of Galtea’s conversation simulator used to generate the user message (input). This should only be provided if the input was generated using the
conversation_simulator_service.