Documentation Index
Fetch the complete documentation index at: https://docs.galtea.ai/llms.txt
Use this file to discover all available pages before exploring further.
How Everything Connects
The dotted lines represent relationships that don’t exist for production-based evaluations, where there is no Test or Test Case and the Session is generated from real user interactions.
The diagram focuses on core concepts. Some related concepts are omitted for clarity — see the Additional Notes section.
At a Glance
Galtea’s concepts split into four lifecycle scopes:- What you’re testing: Your Product, defined by a set of Specifications and iterated on through new Versions.
- What you test with: Tests that group Test Cases; your Product’s performance against them is measured with Metrics.
- What your Product generates: Sessions, each a sequence of Inference Results along with their Traces.
- How you measure performance: Evaluations that apply a Metric to a Session or to a specific Inference Result.
Additional Notes
A few relationships are easy to misread, and some are omitted from the diagram for clarity. These notes fill in the gaps:- Specifications are the glue. A Specification links to one or more Metrics so it can be evaluated, and Tests can be generated from a Specification to challenge it.
- Test Case ↔ Session. A Test Case belongs to a Test. When a Session is generated to simulate a Test Case, it references that Test Case — so the same Test Case can show up across many Sessions (for example, re-running a Test across multiple Versions). Sessions can also be generated without a Test Case, for example when recording real user interactions to evaluate the Product in production.
- Endpoint Connections shape Test generation. When you create a Test with an Endpoint Connection selected, the generated Test Cases are produced to match that endpoint’s expected input format.
- Per-turn evaluations. An Evaluation can target a single Inference Result instead of the whole Session — useful when you want to score, for example, how the agent’s output reflected the retrieval context the RAG system provided on that turn. Per-turn Evaluations still belong to the Session that contains the turn; they additionally point at the specific Inference Result being scored.
- Two types of model. The standard Model is what your Product runs on — linked to a Version and used to track costs, token usage, and similar. The evaluator model is the LLM Galtea uses to perform the judgment for non-deterministic Metrics; it’s selected by name on the Metric, not on the Version. See Models vs. Evaluator Models on the Model page for the full distinction.
- User Groups for human evaluation. A User Group routes Human Evaluation Metrics to a specific set of annotators, so only the right reviewers see those Evaluations on the dashboard.
All Concepts
Product
A functionality or service being evaluated
Specification
A testable behavioral expectation for a product
Version
A specific iteration of a product
Endpoint Connection
A reusable connection to an external endpoint for AI product evaluation
Test
A set of test cases for evaluating product performance
Test Case
Each challenge in a test for evaluating product performance
Session
A full conversation between a user and an AI system.
Inference Result
A single turn in a conversation between a user and the AI.
Trace
Understand how traces capture the internal operations of your AI agents.
Evaluation
The assessment of an evaluation using a specific metric’s criteria
Metric
Ways to evaluate and score product performance
Model
Way to keep track of your models’ costs