Evaluating Conversations

To accurately evaluate interactions within a dialogue, you can use Galtea’s session-based workflow. This approach allows you to log an entire conversation and then run evaluations on all of its turns at once. Certain metrics are specifically designed for conversational analysis and require the full context:

Role Adherence: Measures how well the AI stays within its defined role.
Knowledge Retention: Assesses the model’s ability to remember and use information from previous turns.
Conversation Completeness: Evaluates whether the conversation has reached a natural and informative conclusion.
Conversation Relevancy: Assesses whether each turn in the conversation is relevant to the ongoing topic.

The Session-Based Workflow

Create a Session

A Session acts as a container for all the turns in a single conversation. You create one at the beginning of an interaction.

Log Inference Results

Each user input and model output pair is an Inference Result. You can log these turns individually or in a single batch call after the conversation ends. Using a batch call is more efficient.

Evaluate the Session

Once the session is logged, you can create evaluation tasks for the entire conversation using the evaluation_tasks.create() method. This will generate tasks for each turn against the specified metrics.

Example

This example demonstrates logging and evaluating a multi-turn conversation from a test case.

from galtea import Galtea
import os

galtea = Galtea(api_key=os.getenv("GALTEA_API_KEY"))

YOUR_VERSION_ID = "your_version_id"
YOUR_TEST_CASE_ID = "your_test_case_id"
CONVERSATIONAL_METRICS = [
    "role-adherence",
    "knowledge-retention",
]

# 1. Create a Session linked to a test case
session = galtea.sessions.create(
    version_id=YOUR_VERSION_ID,
    test_case_id=YOUR_TEST_CASE_ID,
)
print(f"Created Session: {session.id}")

# 2. Log the conversation turns.
# In a real scenario, you would dynamically collect these from your product's interaction.
conversation_turns = [
    {'role': 'user', 'content': 'Hello, what can you do?'},
    {'role': 'assistant', 'content': 'I can help you with all your queries related to our services.', 'retrieval_context': None},
    {'role': 'user', 'content': "What's your return policy?"},
    {'role': 'assistant', 'content': "Our return policy allows returns within 30 days.", 'retrieval_context': "Returns are accepted within 30 days of purchase."},
    {'role': 'user', 'content': "What if I lost the receipt?"},
    {'role': 'assistant', 'content': "A proof of purchase is required for all returns.", 'retrieval_context': "Proof of purchase is required for all returns."},
]

# Use create_batch for efficiency
galtea.inference_results.create_batch(
    session_id=session.id,
    conversation_turns=conversation_turns
)

# 3. Evaluate the entire session at once
evaluation_tasks = galtea.evaluation_tasks.create(
    session_id=session.id,
    metrics=CONVERSATIONAL_METRICS
)
print(f"Submitted {len(evaluation_tasks)} evaluation tasks for session {session.id}")

This workflow can also be used for production monitoring by creating a session with is_production=True and omitting the test_case_id. See the Monitor Production Responses guide for an example.

Learn More

Session

A full conversation between a user and an AI system.

Inference Result

A single turn in a conversation between a user and the AI.

Evaluation

A group of evaluable Inference Results from a particular session

Evaluation Task

The assessment of an evaluation using a specific metric type’s criteria

Getting Started

Tutorials

Integrations

Evaluating Conversations

The Session-Based Workflow

Example

Learn More

Session

Inference Result

Evaluation

Evaluation Task

Getting Started

Tutorials

Integrations

​The Session-Based Workflow

​Example

​Learn More

Session

Inference Result

Evaluation

Evaluation Task

The Session-Based Workflow

Example

Learn More