Evaluation Types - Galtea Docs

Every metric in Galtea has an Evaluation Type that determines how responses are scored. Choose the type that matches your scoring needs.

Self-Hosted

This method is for deterministic metrics that you score locally. No evaluation prompt is sent to Galtea. Instead, your custom logic runs on your infrastructure, and the resulting score is uploaded to the platform for tracking and analysis. You can provide the score in two ways via the SDK:

Pre-calculated score: Pass a float value directly in the score field of a MetricInput dictionary. This is the simplest method. Example: {'name': 'my-custom-metric', 'score': 0.85}
Dynamic score calculation: Use the SDK’s CustomScoreEvaluationMetric class to encapsulate your scoring logic, which will be executed at runtime. Example: {'score': MyCustomMetric(name='my-custom-metric')}

Custom Scores Tutorial

Learn how to implement and use custom scoring metrics with the SDK.

AI Evaluation

An LLM scores responses using a scoring rubric you define. You provide the core evaluation logic (your criteria and rubric) in the judge_prompt, then select the necessary data fields (like input, context, etc.) from the Evaluation Parameters list. Galtea dynamically constructs the final prompt by prepending the selected data to the content of your judge_prompt, ensuring a consistent structure for the evaluator model.

The non-deterministic metrics, powered by Large Language Models that act as Judges, utilize models that have demonstrated the best performance in our internal benchmarks and testing. We are committed to continuously evolving and improving these evaluator models to ensure the highest quality assessments over time.

How It Works

You write the evaluation criteria (what to check for)
You define the scoring rubrics (how to score)
You select which evaluation parameters to include
Galtea automatically constructs the complete evaluation prompt

Example Judge Prompt

judge_prompt = """
**Evaluation Criteria:**
Check if the ACTUAL_OUTPUT is good by comparing it to what was expected. Focus on:
1. Factual accuracy and correctness
2. Completeness of the ACTUAL_OUTPUT, regarding the user INPUT
3. Appropriate use of provided CONTEXT information to answer the user INPUT
4. Overall helpfulness and relevance to the user INPUT

**Rubric:**
Score 1 (Good): The ACTUAL_OUTPUT is accurate, complete, uses information properly, and truly helps the user.
Score 0 (Bad): The ACTUAL_OUTPUT has major errors, missing parts, ignores important info, or doesn't help the user.
"""

When designing judge prompts, be specific about your scoring criteria and reference the evaluation parameters explicitly. This ensures consistent and reliable evaluations.

Best Suited For

Custom AI Evaluation metrics are ideal when standard built-in metrics don’t cover your specific needs:

Behavioral Evaluation — Ensuring model behavior aligns with defined product guidelines or safety constraints
Policy Adherence — Checking compliance with brand tone, content rules, or moderation policies
Security Testing — Validating that responses do not cross ethical, privacy, or safety boundaries
Retrieval Use — Evaluating whether the model made appropriate use of retrieved context

AI-Generated Metrics

You can also generate metrics directly from your Specifications. Select one or more specifications, and the AI produces ready-to-use metric candidates — complete with judge prompt, evaluation parameters, tags, and evaluator model. See AI Metric Generation for the full workflow. Alternatively, if you already know what you want to measure, write the name and description yourself and let the AI fill in the judge prompt and evaluation parameters with Complete with AI. See Complete a Metric with AI.

Human Evaluation

Human annotators score responses using the evaluation criteria you define. The metric’s judge_prompt serves as the evaluation rubric for annotators, and evaluation_params determines which data fields are displayed to the evaluator. When an evaluation is created for a human-evaluated metric, it enters a PENDING_HUMAN status and waits for a qualified human evaluator to provide a score.

Don’t want to write the annotator rubric by hand? Provide the metric’s name and description, then use Complete with AI to generate guidelines on the 0–100 scale.

Key Features

User Groups: Link user groups to a metric to control which users can annotate evaluations for it.
Group Membership: Users in a linked group see pending evaluations for that metric in the Human Evaluations page.
Claim & Submit: Evaluators can claim pending evaluations to prevent concurrent editing, then submit their score (0-100 in UI, normalized to 0-1) with an optional reason.

Use Cases

Subjective quality assessments that require human judgment
Edge cases where LLM evaluation may be unreliable
Building gold standard datasets for training custom evaluators
Compliance and audit requirements that mandate human review

SDK Example

quality_reviewers_group = galtea.user_groups.get_by_name(quality_reviewers_group_name)
metric = galtea.metrics.create(
    name=metric_name,
    source="human_evaluation",
    judge_prompt="Evaluate the quality of the response based on accuracy, completeness, and clarity.",
    evaluation_params=["input", "actual_output", "expected_output"],
    user_group_ids=[quality_reviewers_group.id],
)

Human Evaluation Tutorial

Step-by-step guide to setting up human evaluation with user groups, from metric creation to annotation.

Evaluation Parameters

Full reference of data fields available to evaluators.

Metrics Overview

Browse all available metrics and understand the two metric families.

​Self-Hosted

Custom Scores Tutorial

​AI Evaluation

​How It Works

​Example Judge Prompt

​Best Suited For

​AI-Generated Metrics

​Human Evaluation

​Key Features

​Use Cases

​SDK Example

Human Evaluation Tutorial

​Related

Evaluation Parameters

Metrics Overview

Self-Hosted

AI Evaluation

How It Works

Example Judge Prompt

Best Suited For

AI-Generated Metrics

Human Evaluation

Key Features

Use Cases

SDK Example

Related