Returns
Returns a list of Evaluation objects, one for each metric provided.Usage
This method is versatile and can be used for two main scenarios:- Test-Based Evaluation: When you provide a
test_case_id
, Galtea evaluates your product’s performance against a predefined challenge. - Production Monitoring: When you set
is_production=True
and provide aninput
, Galtea logs and evaluates real user interactions.
Example: Test-Based Evaluation with Standard and Custom Metrics
Example: Production Monitoring
Parameters
The ID of the version you want to evaluate.
A list of metrics to use for evaluation. You can provide:
- Standard metrics as strings (e.g., “Role Adherence”).
- Custom, locally-scored metrics as objects inheriting from
CustomScoreEvaluationMetric
.
The actual output produced by the product.
The ID of the test case to be evaluated. Required for non-production evaluations.
The input text/prompt. Required for production evaluations where no
test_case_id
is provided.Set to
True
to indicate the evaluation is from a production environment. Defaults to False
.The context retrieved by your RAG system that was used to generate the
actual_output
.Time in milliseconds from the request to the LLM until the response was received.
Information about token usage during the model call.
Possible keys include:
input_tokens
: Number of input tokens sent to the model.output_tokens
: Number of output tokens generated by the model.cache_read_input_tokens
: Number of input tokens read from the cache.
Information about the cost per token during the model call.
Possible keys include:
cost_per_input_token
: Cost per input token sent to the model.cost_per_output_token
: Cost per output token generated by the model.cost_per_cache_read_input_token
: Cost per input token read from the cache.
The version of Galtea’s conversation simulator used to generate the user message (input). This should only be provided if the input was generated using the
conversation_simulator_service
.