What are Metrics?
Metrics define how Galtea scores your product’s outputs during evaluations. Each metric captures one specific quality dimension — factual accuracy, tone, security resilience, or a custom criterion you define.Metrics are organization-wide and can be reused across multiple products. You can link metrics to Specifications for structured evaluation workflows.
Two Families of Metrics
-
Deterministic metrics apply rule-based logic — string matching, numerical checks, bounding box overlap. Results are consistent and reproducible. Built-in examples: BLEU, ROUGE, Text Similarity, Tool Correctness. You can also define your own using the SDK’s
CustomScoreEvaluationMetricclass — see our tutorial. - Non-deterministic metrics use an LLM-as-a-judge to assess open-ended qualities like factual accuracy, misuse resilience, and task completion. Galtea’s judges are human-aligned and optimized for each evaluation type. You can also create your own AI Evaluation metrics with a custom scoring rubric.
How Metrics Are Evaluated
Every metric has an Evaluation Type that determines how scoring happens:Self-Hosted
Your own logic runs locally; only the score is uploaded.
AI Evaluation
An LLM scores responses using a rubric you define.
Human Evaluation
Human annotators score responses using defined criteria.
Available Metrics
The following table lists the default metrics available in the Galtea platform. You can also create custom metrics or generate metrics from specifications.
| Metric | Category | Description |
|---|---|---|
| Factual Accuracy | RAG | Evaluates whether the actual_output factually aligns with the expected_output. |
| Resilience To Noise | RAG | Evaluates whether the output is resilient to noisy input (typos, OCR/ASR errors). |
| Answer Relevancy | RAG | Evaluates how relevant the actual_output is compared to the provided input. |
| Faithfulness | RAG | Evaluates whether the actual_output factually aligns with the retrieval_context. |
| Contextual Precision | RAG | Evaluates whether relevant retrieval_context nodes are ranked higher than irrelevant ones. |
| Contextual Recall | RAG | Evaluates whether the retrieval_context sufficiently covers the expected_output. |
| Contextual Relevancy | RAG | Evaluates the overall relevance of the retrieval_context for a given input. |
| BLEU | Deterministic | Measures n-gram overlap between actual and expected output. |
| ROUGE | Deterministic | Measures longest common subsequence between actual and expected output. |
| METEOR | Deterministic | Aligns words using exact matches, stems, or synonyms. |
| Text Similarity | Deterministic | Quantifies textual resemblance using character-level fuzzy matching. |
| Text Match (deprecated) | Deterministic | Binary match using character-level fuzzy matching with a threshold. Use Text Similarity instead. |
| IOU | Deterministic | Measures spatial overlap between predicted and reference bounding boxes. |
| Spatial Match | Deterministic | Binary evaluation of spatial alignment using IoU score. |
| URL Validation | Deterministic | Checks if all URLs in the response are valid and safe. |
| Tool Correctness | Deterministic | Compares tools used by the agent against expected tools. |
| JSON Field Match | Deterministic | Compares JSON objects field by field, returning the fraction of expected fields that match. |
| Role Adherence | Conversational | Evaluates whether the chatbot adheres to its given role throughout a conversation. |
| Conversation Completeness | Conversational | Evaluates whether the chatbot satisfies user needs throughout a conversation. |
| Conversation Relevancy | Conversational | Evaluates whether the chatbot generates relevant responses throughout a conversation. |
| Knowledge Retention | Conversational | Assesses factual information retention throughout a conversation. |
| User Satisfaction | Conversational | Evaluates user satisfaction with the chatbot interaction. |
| User Objective Accomplished | Conversational | Evaluates whether the chatbot fulfilled the user’s stated objective. |
| Non-Toxic | Security & Safety | Evaluates whether responses are free of toxic language. |
| Unbiased | Security & Safety | Evaluates whether output is free of gender, racial, or political bias. |
| Misuse Resilience | Security & Safety | Evaluates resilience to misuse and alignment with product description. |
| Data Leakage | Security & Safety | Evaluates whether the LLM returns sensitive information. |
| Jailbreak Resilience | Security & Safety | Evaluates resistance to adversarial prompt manipulation. |
SDK Integration
Metrics Service SDK
Manage metrics using the Python SDK
Metric Properties
The name of the metric. Example: “Factual Accuracy”
A brief description of what the metric evaluates.
The model used to score the metric. Does not apply to deterministic metrics. Example: “GPT-4.1”
Tags for categorization. Example: [“RAG”, “Conversational”]
How outputs are scored: AI Evaluation, Human Evaluation, or Self-Hosted. See Evaluation Types for full details.
User groups for Human Evaluation metrics. Controls which annotators can score evaluations for this metric.
The data fields available to the evaluator. See Evaluation Parameters for the full reference. Not applicable for Self-Hosted metrics.
Related
Evaluation
The assessment of an evaluation using a specific metric’s criteria
Create Custom Metrics
Write your own judge prompt and scoring rubric.