Create Evaluation Task
Create an evaluation task for a given evaluation.
Returns
Returns a list of EvaluationTask objects for the given evaluation and test case. Returns None
if an error occurs.
Example
See an example of running evaluation tasks in our Run Evaluations example.
Parameters
The ID of the evaluation you want to create a task for.
The metrics to use for the evaluation.
The system will create a task for each metric provided.
The ID of the test case you want to use for the evaluation task. While optional, using test_case_id
is highly recommended as it allows for better tracking and reusability of test data. If not provided, you may need to supply input
, expected_output
, and context
directly (see warning below).
The actual output produced by the product.
Precomputed scores for the evaluation tasks, corresponding to the provided metrics
. This list must be the same length as the metrics
list. Each item should be a number between 0 and 1, or None
if the platform should evaluate that specific metric. Useful for deterministic metrics. Example: [0.8, None, 0.95]
.
The context retrieved by your RAG system that was used to generate the actual_output
.
A list of previous conversation turns, each a dictionary with “input” and “actual_output” keys. Used for evaluating conversational AI.
Example: [{"input": "Hello", "actual_output": "Hi there!"}, {"input": "How are you?", "actual_output": "I'm doing well, thanks!"}]
Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.
Token usage information for the LLM call. Keys must be snake_case.
Possible keys: input_tokens
, output_tokens
, cache_read_input_tokens
.
Cost information for the LLM call. Keys must be snake_case.
Possible keys: cost_per_input_token
, cost_per_output_token
, cost_per_cache_read_input_token
.
The parameters input
, expected_output
, and context
for evaluation_tasks.create()
are deprecated.
It is highly recommended to use test_case_id
instead. You can create a test case using galtea.test_cases.create()
and then pass its ID.
This allows for better tracking and reusability of test data.
If test_case_id
is not provided, you may need to provide these deprecated parameters.