Galtea SDK also supports creating evaluation tasks directly from your production environment using the create_from_production method. This is useful for ongoing monitoring and analysis of real user interactions even using past interactions.

Returns

Returns a list of EvaluationTask objects for the given version and user input.

Example

evaluation_tasks = galtea.evaluation_tasks.create_from_production(
    version_id="YOUR_VERSION_ID",
    metrics=["accuracy_v1", "coherence-v1"],
    input=user_query,
    actual_output=model_answer,
    retrieval_context=retrieved_context,
    context=conversation_context,
    conversation_turns=[{"input": past_user_query, "actual_output": past_model_answer}],
    latency=latency,
    usage_info={
        "input_tokens": input_tokens,
        "output_tokens": output_tokens,
        "cache_read_input_tokens": cache_read_input_tokens,
    },
    cost_info={
        "cost_per_input_token": cost_per_input_token,
        "cost_per_output_token": cost_per_output_token,
        "cost_per_cache_read_input_token": cost_per_cache_read_input_token,
    },
)

See an example of running evaluation tasks of your product in production in our Monitor Production Responses to User Queries example.

Parameters

version_id
string
required

The ID of the version of the product you want to evaluate.

metrics
list[string]
required

The metrics to use for the evaluation.

The system will create a task for each metric provided.

input
string
required

The real user query that your product handled in production.

actual_output
string
required

The actual output produced by the product.

context
string

Additional data or broader conversational context that was used and was relevant when the actual_output was generated in production.

retrieval_context
string

The context retrieved by your RAG system that was used to generate the actual output.

conversation_turns
list[dict[string, string]]

A list of previous conversation turns, each a dictionary with “input” and “actual_output” keys. This is used for evaluating conversational AI. Example: [{"input": "Hello", "actual_output": "Hi there!"}, {"input": "How are you?", "actual_output": "I'm doing well, thanks!"}]

latency
float

Time lapsed (in ms) from the moment the request was sent to the LLM to the moment the response was received.

usage_info
dict[str, float]

Token usage information for the LLM call. Keys must be snake_case. Possible keys: input_tokens, output_tokens, cache_read_input_tokens.

cost_info
dict[str, float]

Cost information for the LLM call. Keys must be snake_case. Possible keys: cost_per_input_token, cost_per_output_token, cost_per_cache_read_input_token.