What are Metric Types?
Metric types in Galtea define the specific criteria and methods used to evaluate the performance of your product. They determine how outputs are scored during evaluation tasks, ensuring consistent and meaningful assessment.Metric types are organization-wide and can be reused across multiple products.
Conceptual categories
At Galtea, we use two types of metrics to evaluate large language model (LLM) outputs: deterministic and non-deterministic.-
Deterministic metrics are rule-based and computed using strict logic, such as SQL queries or structural checks. These include things like answer format validation, presence of required fields, or exact string matches. Their results are consistent and reproducible. Common examples of deterministic checks include:
- Answer format validation (e.g., ensuring the output is a valid JSON or follows a specific template)
- Presence of required fields (e.g., checking if all necessary information is included in the response)
- Exact string matches (e.g., verifying if a specific keyword or phrase is present)
- Numerical range checks (e.g., confirming a value falls within an acceptable range)
- Boolean condition checks (e.g., ensuring a specific condition evaluates to true or false as expected)
CustomScoreMetric
base class and pass an instance of it to the evaluation task creation method. See more in our example.The platform cannot automatically evaluate deterministic metrics, as it lacks the necessary information. However, by using theCustomScoreMetric
class in the SDK, you can execute your custom logic locally and have the scores seamlessly uploaded to the platform for visualization. - Non-deterministic metrics are powered by the G-Eval evaluation framework, an LLM-as-a-judge methodology, originally introduced in the paper “NLG Evaluation using GPT-4 with Better Human Alignment”. G-Eval follows a structured two-step process (chain-of-thought generation and score computation) to rate outputs. These metrics assess aspects like factuality, coherence, helpfulness, or tone.
Galtea metrics are versatile and can be applied to any type of output, including strings, numbers, and boolean values.
- For deterministic metrics: Evaluating numerical or boolean outputs is straightforward. For example, you can check if a returned numerical value is within a valid range or if a boolean output matches a specific condition.
- For non-deterministic metrics: While typically used for open-ended text, they can also assess the reasoning or justification behind a numerical or boolean value when needed. For instance, verifying whether a model’s numeric prediction aligns with the provided context or input data.
List of metrics available in the Galtea Platform
The following table provides a summary of the default metrics available in the Galtea platform. You can also create custom metrics tailored to your specific needs.
Metric | Category | Description |
---|---|---|
Factual Accuracy | RAG | Measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the expected_output. |
Resilience To Noise | RAG | Evaluates whether the generated output is resilient to noisy input, such as typos, OCR/ASR errors, and distracting content. |
Answer Relevancy | RAG | Measures the quality of your RAG pipeline’s generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input. |
Faithfulness | RAG | Measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the contents of your retrieval_context. |
Contextual Precision | RAG | Measures your RAG pipeline’s retriever by evaluating whether nodes in your retrieval_context that are relevant to the given input are ranked higher than irrelevant ones. |
Contextual Recall | RAG | Measures the quality of your RAG pipeline’s retriever by evaluating the extent of which the retrieval_context aligns with the expected_output. |
Contextual Relevancy | RAG | Measures the quality of your RAG pipeline’s retriever by evaluating the overall relevance of the information presented in your retrieval_context for a given input. |
Custom Judge | RAG | A flexible LLM-as-a-judge evaluation metric designed to assess model outputs based on customizable product rubrics, domain-specific behavior expectations, and grounded context. |
BLEU | Deterministic | Measures how many n-grams in the actual output overlap with those in a set of expected output. |
ROUGE | Deterministic | Evaluates automatic summarization by measuring the longest common subsequence that preserves the word order between actual output and expected output summaries. |
METEOR | Deterministic | Evaluates translation, summarization, and paraphrasing by aligning words using exact matches, stems, or synonyms. |
Text Similarity | Deterministic | Quantifies the overall textual resemblance between a generated summary and a reference summary by using character-level fuzzy matching. |
Text Match | Deterministic | Determines whether generated text matches a reference with high character-level similarity using fuzzy matching, returning a binary outcome based on a threshold. |
IOU | Deterministic | Measures the spatial overlap between a predicted bounding box and one or more reference boxes to quantify alignment in object detection and layout tasks. |
Spatial Match | Deterministic | Performs a binary evaluation of the spatial alignment between a predicted bounding box and reference boxes using the best Intersection over Union (IoU) score to return a pass/fail signal. |
Role Adherence | Conversational | Determines whether your LLM chatbot is able to adhere to its given role throughout a conversation. |
Conversation Completeness | Conversational | Determines whether your LLM chatbot is able to complete an end-to-end conversation by satisfying user needs throughout a conversation. |
Conversation Relevancy | Conversational | Determines whether your LLM chatbot is able to consistently generate relevant responses throughout a conversation. |
Knowledge Retention | Conversational | Assesses the ability of your LLM chatbot to retain factual information presented throughout a conversation. |
Non-Toxic | Red Teaming | Determines whether the responses of your LLM based product responds are free of toxic language. |
Unbiased | Red Teaming | Determine whether your LLM output is free of gender, racial, or political bias. |
Misuse Resilience | Red Teaming | Evaluates whether the generated output is resilient to misuse and remains aligned with the product description. |
Data Leakage | Red Teaming | Evaluates whether the LLM returns content that may include sensitive information. |
Jailbreak Resilience | Red Teaming | A previous version of the Jailbreak Resilience v2 Metric Type. |
Jailbreak Resilience v2 | Red Teaming | Evaluates the ability of an LLM-based product to resist attempts at breaking or manipulating its intended behavior. |
Metric Type Properties
The name of the metric type. Example: “Factual Accuracy”
A brief description of what the metric type evaluates.
A URL pointing to more detailed documentation about the metric type.
The name of the model used to evaluate the metric. This model will be used to assess the quality of the outputs based on the metric’s criteria. Example: “GPT-4.1”.
It does not apply to deterministic metrics since it does not require a model for evaluation.
The tags for the metric type so it can be easily identified and categorized. Example: [“RAG”, “Conversational”]
The method used to evaluate the metric. This can be either:
No validation method is applied. This means the metric type is not automatically evaluated by Galtea and is intended for deterministic metrics.
This must not be provided if the validation method is set to “Custom”.
Parameter | Description | Availability |
---|---|---|
input | The prompt or query sent to the model. (Always required in the list). | All metrics |
actual_output | The actual output generated by the model. | All metrics |
expected_output | The ideal answer for the given input. | All metrics |
context | Additional background information provided to the model alongside the input. | All metrics |
retrieval_context | The context retrieved by your RAG system before sending the user query to your LLM. | All metrics |
product_description | The description of the product. | Custom Judge only |
product_capabilities | The capabilities of the product. | Custom Judge only |
product_inabilities | The product’s known inabilities or restrictions. | Custom Judge only |
product_security_boundaries | The security boundaries of the product. | Custom Judge only |
input
(lowercase) must always be included in this list.You can directly reference these parameters in your criteria or evaluation steps. For example: “Evaluate if the Actual Output contains factually correct information that aligns with verified sources in the Retrieval Context.”
To ensure accurate evaluation results, include only parameters that you’ve explicitly referenced in your criteria or evaluation steps.
G-Eval
The non-deterministic metrics, powered by LLM-as-a-judge (G-Eval), utilize models that have demonstrated the best performance in our internal benchmarks and testing. We are committed to continuously evolving and improving these evaluator models to ensure the highest quality assessments over time.
Evaluation Criteria vs. Evaluation Steps
When using non-deterministic metrics, you can choose between two approaches: evaluation criteria and evaluation steps. Both methods are designed to assess the quality of LLM outputs, but they differ in their focus and structure.Evaluation Criteria
What matters in a response, defining the high-level qualities or standards
Evaluation Steps
How to measure a response’s quality, providing specific assessment actions
Evaluation Criteria
Evaluation criteria are high-level qualities or standards that define what makes a response good or bad. They outline fundamental aspects that should be assessed without specifying exactly how to measure them.Purpose
Purpose
- Ensure evaluations align with user needs and expectations
- Establish a framework for selecting appropriate metrics
- Guide AI model development by focusing on specific areas for improvement
Examples
Examples
- Accuracy: The response must be factually correct and based on verified knowledge from the “context”
- Professionalism: The language should be respectful, clear, and aligned with professional communication standards
- Relevance: The response must directly address the user’s “input” without unnecessary information
- Conciseness: The summary should be brief but still capture essential information
- Completeness: All key details from the original document (“context”) must be included
- Fluency: The summary should be grammatically correct and easy to read
Evaluation criteria define what matters in a response, serving as the foundation for meaningful assessment.
Evaluation Steps
Evaluation steps are the specific actions taken to measure how well a response meets the evaluation criteria. These steps break down the assessment into concrete, structured processes that reference evaluation parameters.Purpose
Purpose
- Provide a systematic method for evaluating responses
- Reduce subjectivity by defining clear measurement techniques
- Allow customization based on the priorities of an AI system
- Explicitly reference evaluation parameters to ensure consistent assessment
Example: Accuracy Steps
Example: Accuracy Steps
- Check if the Actual Output contains facts that align with verified sources provided in the Retrieval Context
- Identify any contradictions between the Actual Output and established knowledge in the Context
- Compare the Actual Output against the Expected Output for factual consistency
- Penalize statements in the Actual Output that introduce speculation without citing a credible source
Example: Completeness Steps
Example: Completeness Steps
- Compare the key points of the original document in Context with the Actual Output
- Identify any missing crucial information from the Context that would affect the meaning
- Compare the Actual Output against the Expected Output for coverage of essential points
- Score the Actual Output based on how many critical details from the Input and Context are retained
Evaluation steps define how to measure a response’s quality based on the evaluation criteria, making explicit reference to specific evaluation parameters like Input, Actual Output, Expected Output, Retrieval Context, and Context.
When you do supply explicit
evaluation_steps
, G-Eval skips auto-generation—yielding more reliable, reproducible scores ideal for production.Comparing Evaluation Approaches
The following table highlights the key differences between evaluation criteria and evaluation steps:Aspect | Evaluation Criteria | Evaluation Steps |
---|---|---|
Definition | High-level qualities that define what makes a response good or bad | Step-by-step actions to measure a response’s quality |
Purpose | Establish broad goals for evaluation | Provide a systematic method to assess responses |
Focus | What should be measured | How to measure it |
Examples | Accuracy, conciseness, relevance, fluency | Compare facts, check for contradictions, assess completeness |
Flexibility | General principles that apply across many use cases | Specific steps that vary depending on the system |
Evaluation Details
Two-step process- Chain-of-Thought Generation (optional): If you do not provide
evaluation_steps
, G-Eval auto-generates them from yourcriteria
via a chain-of-thought prompt—making setup fast but introducing some variability. - Score Computation: G-Eval prompts the LLM with the steps and test-case parameters (
input
,actual_output
, etc.) to rate on a 1–5 scale, then normalizes the result to 0–1 via token-probability weighting.
Custom Judge
you can also provide a custom judge by passing a prompt to control even further how the evaluation is performed by our evaluator. Example:SDK Integration
The Galtea SDK allows you to create, view, and manage metric types programmatically.Metrics Service SDK
Manage metric types using the Python SDK