- Role Adherence: Measures how well the AI stays within its defined role.
- Knowledge Retention: Assesses the model’s ability to remember and use information from previous turns.
- Conversation Completeness: Evaluates whether the conversation has reached a natural and informative conclusion.
- Conversation Relevancy: Assesses whether each turn in the conversation is relevant to the ongoing topic.
The Session-Based Workflow
Create a Session
A Session acts as a container for all the turns in a single conversation. You create one at the beginning of an interaction.
Log Inference Results
Each user input and model output pair is an Inference Result. You can log these turns individually or in a single batch call after the conversation ends. Using a batch call is more efficient.
Define your agent function
First, define an agent function that connects Galtea to your product:- Simple
- Chat History
- Structured
The quickest way to get started. Your function receives just the latest user message as a string.
All three signatures work with
evaluations.run(), inference_results.generate(), and simulator.simulate(). Both sync and async functions are supported. The SDK auto-detects which signature you’re using from the type hint on the first parameter.For the full list of fields available on AgentInput (including structured input access via message metadata), see the AgentInput reference.Determine your use case
- Test-based evaluation
- Past conversations (offline ingestion)
- Monitoring (production)
Use this when you have test cases. It requires
test_case_id and is often combined with the Conversation Simulator to generate turns.See the full workflow in Simulating Conversations.
Custom Metrics with Full Conversation Access
When you useCustomScoreEvaluationMetric, your measure() method always receives an inference_results parameter containing InferenceResult objects. For session evaluations this includes all turns; for single inference result evaluations it contains one item. This enables conversation-level scoring (e.g., consistency checks, cross-turn analysis).
Each
InferenceResult object provides actual_input, actual_output, retrieval_context, latency, index, and other fields. See the custom metrics tutorial for more on custom metrics.Learn More
Session
A full conversation between a user and an AI system.
Inference Result
A single turn in a conversation between a user and the AI.
Evaluation
The assessment of an evaluation using a specific metric’s criteria
Conversation Simulator
Test your conversational AI with simulated multi-turn conversations