This method evaluates an entire conversation stored in a Session by creating evaluation tasks for each of its Inference Results.

Returns

Returns a list of EvaluationTask objects, one for each metric and each inference result in the session.

Example

# First, create a session and log inference results
session = galtea.sessions.create(version_id="YOUR_VERSION_ID")
galtea.inference_results.create_batch(
    session_id=session.id,
    conversation_turns=[
        {"role": "user", "content": "Hi"},
        {"role": "assistant", "content": "Hello!"},
        {"role": "user", "content": "How are you?"},
        {"role": "assistant", "content": "I am fine, thank you."}
    ]
)

# Now, evaluate the entire session
evaluation_tasks = galtea.evaluation_tasks.create(
    session_id=session.id,
    metrics=["coherence", "conversation-relevancy"]
)

Parameters

session_id
string
required

The ID of the session containing the inference results to be evaluated.

metrics
list[string]
required

A list of metric type names to use for the evaluation. Tasks will be created for each metric against each inference result.

scores
list[float | None]

A list of pre-computed scores corresponding to the metrics list. Use None for metrics that Galtea should evaluate. This is useful for providing scores from custom, deterministic metrics.