Introduction

Galtea is the evaluation platform for AI products. Test accuracy, safety, and behavior — from RAG pipelines to conversational agents to security testing.

Get Started

Integrate the SDK

Install, authenticate, and run your first evaluation in Python.

Run your first evaluation

Quickstart: register a product, run tests, and view results.

Understand the platform

Products, specs, tests, metrics, and evaluations — the full model.

How It Works

Galtea helps you evaluate AI products through a repeatable test-measure-iterate cycle:

Define Specifications

Define Specifications — testable behavioral expectations for your product (capabilities, inabilities, policies).

Generate Metrics & Tests

Galtea generates metrics and tests from your specifications, or you create them manually.

Create Version

Define a new Version of your product to track changes over time.

Run Evaluations

Run Evaluations — evaluations.run() resolves specs, tests, and metrics automatically.

Analyze & Iterate

Review results in the Analytics dashboard, then iterate with new versions to track improvements.

Platform Access

You can interact with Galtea through multiple channels:

Web Platform

Manage your products and access insights via the dashboard.

Python SDK

Seamlessly integrate our services using the Python SDK.

CLI

Drive Galtea from the terminal with the galtea binary.

Agent Skill

Let Claude Code, Cursor, and other AI coding assistants drive Galtea on your behalf.

REST API

Explore the full API reference — every endpoint, parameter, and response schema.

GitHub Actions

Automate your workflows by integrating with GitHub Actions.

Core Concepts

Galtea is built around several key concepts that work together to provide comprehensive evaluation of AI products. For a diagram of how they all connect, start with the Concepts overview.

Concepts overview

How Galtea’s concepts connect — diagram + per-entity quick reference.

Product

A functionality or service being evaluated

Specification

A testable behavioral expectation for a product

Version

A specific iteration of a product

Test

A set of test cases for evaluating product performance

Session

A full conversation between a user and an AI system.

Inference Result

A single turn in a conversation between a user and the AI.

Evaluation

The assessment of an evaluation using a specific metric’s criteria

Metric

Ways to evaluate and score product performance

Model

Way to keep track of your models’ costs

Registration

Getting Started

Core Workflows

Production & Monitoring

Advanced

Playbooks

Integrations

Security & Administration

Get Started

Integrate the SDK

Run your first evaluation

Understand the platform

How It Works

Platform Access

Web Platform

Python SDK

CLI

Agent Skill

REST API

GitHub Actions

Core Concepts

Concepts overview

Product

Specification

Version

Test

Session

Inference Result

Evaluation

Metric

Model

​Get Started

Integrate the SDK

Run your first evaluation

Understand the platform

​How It Works

​Platform Access

Web Platform

Python SDK

CLI

Agent Skill

REST API

GitHub Actions

​Core Concepts

Concepts overview

Product

Specification

Version

Test

Session

Inference Result

Evaluation

Metric

Model

Get Started

How It Works

Platform Access

Core Concepts