Evaluating Conversations
Learn how to evaluate multi-turn conversations using Galtea’s session-based workflow.
To accurately evaluate interactions within a dialogue, you can use Galtea’s session-based workflow. This approach allows you to log an entire conversation and then run evaluations on all of its turns at once.
Certain metrics are specifically designed for conversational analysis and require the full context:
- Role Adherence: Measures how well the AI stays within its defined role.
- Knowledge Retention: Assesses the model’s ability to remember and use information from previous turns.
- Conversation Completeness: Evaluates whether the conversation has reached a natural and informative conclusion.
- Conversation Relevancy: Assesses whether each turn in the conversation is relevant to the ongoing topic.
The Session-Based Workflow
Create a Session
A Session acts as a container for all the turns in a single conversation. You create one at the beginning of an interaction.
Log Inference Results
Each user input and model output pair is an Inference Result. You can log these turns individually or in a single batch call after the conversation ends. Using a batch call is more efficient.
Evaluate the Session
Once the session is logged, you can create evaluation tasks for the entire conversation using the evaluation_tasks.create()
method. This will generate tasks for each turn against the specified metrics.
Example
This example demonstrates logging and evaluating a multi-turn conversation from a test case.
This workflow can also be used for production monitoring by creating a session with is_production=True
and omitting the test_case_id
. See the Monitor Production Responses guide for an example.