Overview

How Everything Connects

The dotted lines represent relationships that don’t exist for production-based evaluations, where there is no Test or Test Case and the Session is generated from real user interactions.

The diagram focuses on core concepts. Some related concepts are omitted for clarity — see the Additional Notes section.

At a Glance

Galtea’s concepts split into four lifecycle scopes:

What you’re testing: Your Product, defined by a set of Specifications and iterated on through new Versions.
What you test with: Tests that group Test Cases; your Product’s performance against them is measured with Metrics.
What your Product generates: Sessions, each a sequence of Inference Results along with their Traces.
How you measure performance: Evaluations that apply a Metric to a Session or to a specific Inference Result.

Additional Notes

A few relationships are easy to misread, and some are omitted from the diagram for clarity. These notes fill in the gaps:

Specifications are the glue. A Specification links to one or more Metrics so it can be evaluated, and Tests can be generated from a Specification to challenge it.
Test Case ↔ Session. A Test Case belongs to a Test. When a Session is generated to simulate a Test Case, it references that Test Case — so the same Test Case can show up across many Sessions (for example, re-running a Test across multiple Versions). Sessions can also be generated without a Test Case, for example when recording real user interactions to evaluate the Product in production.
Endpoint Connections shape Test generation. When you create a Test with an Endpoint Connection selected, the generated Test Cases are produced to match that endpoint’s expected input format.
Per-turn evaluations. An Evaluation can target a single Inference Result instead of the whole Session — useful when you want to score, for example, how the agent’s output reflected the retrieval context the RAG system provided on that turn. Per-turn Evaluations still belong to the Session that contains the turn; they additionally point at the specific Inference Result being scored.
Two types of model. The standard Model is what your Product runs on — linked to a Version and used to track costs, token usage, and similar. The evaluator model is the LLM Galtea uses to perform the judgment for non-deterministic Metrics; it’s selected by name on the Metric, not on the Version. See Models vs. Evaluator Models on the Model page for the full distinction.
User Groups for human evaluation. A User Group routes Human Evaluation Metrics to a specific set of annotators, so only the right reviewers see those Evaluations on the dashboard.

All Concepts

Product

A functionality or service being evaluated

Specification

A testable behavioral expectation for a product

Version

A specific iteration of a product

Endpoint Connection

A reusable connection to an external endpoint for AI product evaluation

Test

A set of test cases for evaluating product performance

Test Case

Each challenge in a test for evaluating product performance

Session

A full conversation between a user and an AI system.

Inference Result

A single turn in a conversation between a user and the AI.

Trace

Understand how traces capture the internal operations of your AI agents.

Evaluation

The assessment of an evaluation using a specific metric’s criteria

Metric

Ways to evaluate and score product performance

Model

Way to keep track of your models’ costs

Introduction

SDK

CLI

Concepts

How Everything Connects

At a Glance

Additional Notes

All Concepts

Product

Specification

Version

Endpoint Connection

Test

Test Case

Session

Inference Result

Trace

Evaluation

Metric

Model

Introduction

SDK

CLI

Concepts

Documentation Index

​How Everything Connects

​At a Glance

​Additional Notes

​All Concepts

Product

Specification

Version

Endpoint Connection

Test

Test Case

Session

Inference Result

Trace

Evaluation

Metric

Model

How Everything Connects

At a Glance

Additional Notes

All Concepts