Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.galtea.ai/llms.txt

Use this file to discover all available pages before exploring further.

How Everything Connects

The dotted lines represent relationships that don’t exist for production-based evaluations, where there is no Test or Test Case and the Session is generated from real user interactions.
The diagram focuses on core concepts. Some related concepts are omitted for clarity — see the Additional Notes section.

At a Glance

Galtea’s concepts split into four lifecycle scopes:

Additional Notes

A few relationships are easy to misread, and some are omitted from the diagram for clarity. These notes fill in the gaps:
  • Specifications are the glue. A Specification links to one or more Metrics so it can be evaluated, and Tests can be generated from a Specification to challenge it.
  • Test Case ↔ Session. A Test Case belongs to a Test. When a Session is generated to simulate a Test Case, it references that Test Case — so the same Test Case can show up across many Sessions (for example, re-running a Test across multiple Versions). Sessions can also be generated without a Test Case, for example when recording real user interactions to evaluate the Product in production.
  • Endpoint Connections shape Test generation. When you create a Test with an Endpoint Connection selected, the generated Test Cases are produced to match that endpoint’s expected input format.
  • Per-turn evaluations. An Evaluation can target a single Inference Result instead of the whole Session — useful when you want to score, for example, how the agent’s output reflected the retrieval context the RAG system provided on that turn. Per-turn Evaluations still belong to the Session that contains the turn; they additionally point at the specific Inference Result being scored.
  • Two types of model. The standard Model is what your Product runs on — linked to a Version and used to track costs, token usage, and similar. The evaluator model is the LLM Galtea uses to perform the judgment for non-deterministic Metrics; it’s selected by name on the Metric, not on the Version. See Models vs. Evaluator Models on the Model page for the full distinction.
  • User Groups for human evaluation. A User Group routes Human Evaluation Metrics to a specific set of annotators, so only the right reviewers see those Evaluations on the dashboard.

All Concepts

Product

A functionality or service being evaluated

Specification

A testable behavioral expectation for a product

Version

A specific iteration of a product

Endpoint Connection

A reusable connection to an external endpoint for AI product evaluation

Test

A set of test cases for evaluating product performance

Test Case

Each challenge in a test for evaluating product performance

Session

A full conversation between a user and an AI system.

Inference Result

A single turn in a conversation between a user and the AI.

Trace

Understand how traces capture the internal operations of your AI agents.

Evaluation

The assessment of an evaluation using a specific metric’s criteria

Metric

Ways to evaluate and score product performance

Model

Way to keep track of your models’ costs