Skip to main content

What are Metrics?

Metrics define how Galtea scores your product’s outputs during evaluations. Each metric captures one specific quality dimension — factual accuracy, tone, security resilience, or a custom criterion you define.
Metrics are organization-wide and can be reused across multiple products. You can link metrics to Specifications for structured evaluation workflows.
You can create, view and manage your metrics on the Galtea dashboard or programmatically using the Galtea SDK.

Two Families of Metrics

  • Deterministic metrics apply rule-based logic — string matching, numerical checks, bounding box overlap. Results are consistent and reproducible. Built-in examples: BLEU, ROUGE, Text Similarity, Tool Correctness. You can also define your own using the SDK’s CustomScoreEvaluationMetric class — see our tutorial.
  • Non-deterministic metrics use an LLM-as-a-judge to assess open-ended qualities like factual accuracy, misuse resilience, and task completion. Galtea’s judges are human-aligned and optimized for each evaluation type. You can also create your own AI Evaluation metrics with a custom scoring rubric.

How Metrics Are Evaluated

Every metric has an Evaluation Type that determines how scoring happens:

Self-Hosted

Your own logic runs locally; only the score is uploaded.

AI Evaluation

An LLM scores responses using a rubric you define.

Human Evaluation

Human annotators score responses using defined criteria.

Available Metrics

The following table lists the default metrics available in the Galtea platform. You can also create custom metrics or generate metrics from specifications.
MetricCategoryDescription
Factual AccuracyRAGEvaluates whether the actual_output factually aligns with the expected_output.
Resilience To NoiseRAGEvaluates whether the output is resilient to noisy input (typos, OCR/ASR errors).
Answer RelevancyRAGEvaluates how relevant the actual_output is compared to the provided input.
FaithfulnessRAGEvaluates whether the actual_output factually aligns with the retrieval_context.
Contextual PrecisionRAGEvaluates whether relevant retrieval_context nodes are ranked higher than irrelevant ones.
Contextual RecallRAGEvaluates whether the retrieval_context sufficiently covers the expected_output.
Contextual RelevancyRAGEvaluates the overall relevance of the retrieval_context for a given input.
BLEUDeterministicMeasures n-gram overlap between actual and expected output.
ROUGEDeterministicMeasures longest common subsequence between actual and expected output.
METEORDeterministicAligns words using exact matches, stems, or synonyms.
Text SimilarityDeterministicQuantifies textual resemblance using character-level fuzzy matching.
Text Match (deprecated)DeterministicBinary match using character-level fuzzy matching with a threshold. Use Text Similarity instead.
IOUDeterministicMeasures spatial overlap between predicted and reference bounding boxes.
Spatial MatchDeterministicBinary evaluation of spatial alignment using IoU score.
URL ValidationDeterministicChecks if all URLs in the response are valid and safe.
Tool CorrectnessDeterministicCompares tools used by the agent against expected tools.
JSON Field MatchDeterministicCompares JSON objects field by field, returning the fraction of expected fields that match.
Role AdherenceConversationalEvaluates whether the chatbot adheres to its given role throughout a conversation.
Conversation CompletenessConversationalEvaluates whether the chatbot satisfies user needs throughout a conversation.
Conversation RelevancyConversationalEvaluates whether the chatbot generates relevant responses throughout a conversation.
Knowledge RetentionConversationalAssesses factual information retention throughout a conversation.
User SatisfactionConversationalEvaluates user satisfaction with the chatbot interaction.
User Objective AccomplishedConversationalEvaluates whether the chatbot fulfilled the user’s stated objective.
Non-ToxicSecurity & SafetyEvaluates whether responses are free of toxic language.
UnbiasedSecurity & SafetyEvaluates whether output is free of gender, racial, or political bias.
Misuse ResilienceSecurity & SafetyEvaluates resilience to misuse and alignment with product description.
Data LeakageSecurity & SafetyEvaluates whether the LLM returns sensitive information.
Jailbreak ResilienceSecurity & SafetyEvaluates resistance to adversarial prompt manipulation.

SDK Integration

Metrics Service SDK

Manage metrics using the Python SDK

Metric Properties

Name
Text
required
The name of the metric. Example: “Factual Accuracy”
Description
Text
A brief description of what the metric evaluates.
Evaluator Model
Text
The model used to score the metric. Does not apply to deterministic metrics. Example: “GPT-4.1”
Tags
Text List
Tags for categorization. Example: [“RAG”, “Conversational”]
Evaluation Type
Enum
required
How outputs are scored: AI Evaluation, Human Evaluation, or Self-Hosted. See Evaluation Types for full details.
User Groups
List[String]
User groups for Human Evaluation metrics. Controls which annotators can score evaluations for this metric.
Evaluation Parameters
List[Enum]
required
The data fields available to the evaluator. See Evaluation Parameters for the full reference. Not applicable for Self-Hosted metrics.

Evaluation

The assessment of an evaluation using a specific metric’s criteria

Create Custom Metrics

Write your own judge prompt and scoring rubric.