Self-Hosted
This method is for deterministic metrics that you score locally. No evaluation prompt is sent to Galtea. Instead, your custom logic runs on your infrastructure, and the resulting score is uploaded to the platform for tracking and analysis. You can provide the score in two ways via the SDK:-
Pre-calculated score: Pass a
floatvalue directly in thescorefield of aMetricInputdictionary. This is the simplest method. Example:{'name': 'my-custom-metric', 'score': 0.85} -
Dynamic score calculation: Use the SDK’s
CustomScoreEvaluationMetricclass to encapsulate your scoring logic, which will be executed at runtime. Example:{'score': MyCustomMetric(name='my-custom-metric')}
Custom Scores Tutorial
Learn how to implement and use custom scoring metrics with the SDK.
AI Evaluation
An LLM scores responses using a scoring rubric you define. You provide the core evaluation logic (your criteria and rubric) in thejudge_prompt, then select the necessary data fields (like input, context, etc.) from the Evaluation Parameters list. Galtea dynamically constructs the final prompt by prepending the selected data to the content of your judge_prompt, ensuring a consistent structure for the evaluator model.
The non-deterministic metrics, powered by Large Language Models that act as Judges, utilize models that have demonstrated the best performance in our internal benchmarks and testing. We are committed to continuously evolving and improving these evaluator models to ensure the highest quality assessments over time.
How It Works
- You write the evaluation criteria (what to check for)
- You define the scoring rubrics (how to score)
- You select which evaluation parameters to include
- Galtea automatically constructs the complete evaluation prompt
Example Judge Prompt
When designing judge prompts, be specific about your scoring criteria and reference the evaluation parameters explicitly. This ensures consistent and reliable evaluations.
Best Suited For
Custom AI Evaluation metrics are ideal when standard built-in metrics don’t cover your specific needs:- Behavioral Evaluation — Ensuring model behavior aligns with defined product guidelines or safety constraints
- Policy Adherence — Checking compliance with brand tone, content rules, or moderation policies
- Security Testing — Validating that responses do not cross ethical, privacy, or safety boundaries
- Retrieval Use — Evaluating whether the model made appropriate use of retrieved context
AI-Generated Metrics
You can also generate metrics directly from your Specifications. Select one or more specifications, and the AI produces ready-to-use metric candidates — complete with judge prompt, evaluation parameters, tags, and evaluator model. See AI Metric Generation for the full workflow.Human Evaluation
Human annotators score responses using the evaluation criteria you define. The metric’sjudge_prompt serves as the evaluation rubric for annotators, and evaluation_params determines which data fields are displayed to the evaluator.
When an evaluation is created for a human-evaluated metric, it enters a PENDING_HUMAN status and waits for a qualified human evaluator to provide a score.
Key Features
- User Groups: Link user groups to a metric to control which users can annotate evaluations for it.
- Group Membership: Users in a linked group see pending evaluations for that metric in the Human Evaluations page.
- Claim & Submit: Evaluators can claim pending evaluations to prevent concurrent editing, then submit their score (0-100 in UI, normalized to 0-1) with an optional reason.
Use Cases
- Subjective quality assessments that require human judgment
- Edge cases where LLM evaluation may be unreliable
- Building gold standard datasets for training custom evaluators
- Compliance and audit requirements that mandate human review
SDK Example
Human Evaluation Tutorial
Step-by-step guide to setting up human evaluation with user groups, from metric creation to annotation.
Related
Evaluation Parameters
Full reference of data fields available to evaluators.
Metrics Overview
Browse all available metrics and understand the two metric families.