Metric

What are Metrics?

Metrics in Galtea define the specific criteria and methods used to evaluate the performance of your product. They determine how outputs are scored during evaluations, ensuring consistent and meaningful assessment.

Metrics are organization-wide and can be reused across multiple products.

You can create, view and manage your metrics on the Galtea dashboard or programmatically using the Galtea SDK.

Conceptual categories

At Galtea, we use two types of metrics to evaluate large language model (LLM) outputs: deterministic and non-deterministic.

Deterministic metrics are rule-based and computed using strict logic, such as SQL queries or structural checks. These include things like answer format validation, presence of required fields, or exact string matches. Their results are consistent and reproducible. Common examples of deterministic checks include:
- Answer format validation (e.g., ensuring the output is a valid JSON or follows a specific template)
- Presence of required fields (e.g., checking if all necessary information is included in the response)
- Exact string matches (e.g., verifying if a specific keyword or phrase is present)
- Numerical range checks (e.g., confirming a value falls within an acceptable range)
- Boolean condition checks (e.g., ensuring a specific condition evaluates to true or false as expected)
Currently the deterministic metrics accessible directly from the platform are BLEU, ROUGE, METEOR, Text Similarity, Text Match, IOU and Spatial Match and URL Validation. Additionally you can define your own custom deterministic metrics. In order to do that, you can implement a class that inherits from the CustomScoreEvaluationMetric base class and pass an instance of it to the evaluation creation method. See more in our example.
The platform cannot automatically evaluate deterministic metrics, as it lacks the necessary information. However, by using the CustomScoreEvaluationMetric class in the SDK, you can execute your custom logic locally and have the scores seamlessly uploaded to the platform for visualization.
Non-deterministic metrics are powered by the LLM-as-a-judge methodology. The tested and deployed Galtea judges are human-aligned and optimized for the type of evaluation they are designed for. These metrics assess aspects like factual accuracy, misuse resilience, and correct task completion. Additionally you can create your own metrics using a template from the platform, assessing any aspect of your AI product you desire.

Galtea metrics are versatile and can be applied to any type of output, including strings, numbers, and boolean values.

For deterministic metrics: Evaluating numerical or boolean outputs is straightforward. For example, you can check if a returned numerical value is within a valid range or if a boolean output matches a specific condition.
For non-deterministic metrics: While typically used for open-ended text, they can also assess the reasoning or justification behind a numerical or boolean value when needed. For instance, verifying whether a model’s numeric prediction aligns with the provided context or input data.

List of metrics available in the Galtea Platform

The following table provides a summary of the default metrics available in the Galtea platform. You can also create custom metrics tailored to your specific needs.

Metric	Category	Description
Factual Accuracy	RAG	Measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the expected_output.
Resilience To Noise	RAG	Evaluates whether the generated output is resilient to noisy input, such as typos, OCR/ASR errors, and distracting content.
Answer Relevancy	RAG	Measures the quality of your RAG pipeline’s generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input.
Faithfulness	RAG	Measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the contents of your retrieval_context.
Contextual Precision	RAG	Measures your RAG pipeline’s retriever by evaluating whether nodes in your retrieval_context that are relevant to the given input are ranked higher than irrelevant ones.
Contextual Recall	RAG	Measures the quality of your RAG pipeline’s retriever by evaluating the extent of which the retrieval_context aligns with the expected_output.
Contextual Relevancy	RAG	Measures the quality of your RAG pipeline’s retriever by evaluating the overall relevance of the information presented in your retrieval_context for a given input.
BLEU	Deterministic	Measures how many n-grams in the actual output overlap with those in a set of expected output.
ROUGE	Deterministic	Evaluates automatic summarization by measuring the longest common subsequence that preserves the word order between actual output and expected output summaries.
METEOR	Deterministic	Evaluates translation, summarization, and paraphrasing by aligning words using exact matches, stems, or synonyms.
Text Similarity	Deterministic	Quantifies the overall textual resemblance between a generated summary and a reference summary by using character-level fuzzy matching.
Text Match	Deterministic	Determines whether generated text matches a reference with high character-level similarity using fuzzy matching, returning a binary outcome based on a threshold.
IOU	Deterministic	Measures the spatial overlap between a predicted bounding box and one or more reference boxes to quantify alignment in object detection and layout tasks.
Spatial Match	Deterministic	Performs a binary evaluation of the spatial alignment between a predicted bounding box and reference boxes using the best Intersection over Union (IoU) score to return a pass/fail signal.
URL Validation	Deterministic	Performs a binary evaluation to check if all the URLs present in the model response are valid and safe.
Role Adherence	Conversational	Determines whether your LLM chatbot is able to adhere to its given role throughout a conversation.
Conversation Completeness	Conversational	Determines whether your LLM chatbot is able to complete an end-to-end conversation by satisfying user needs throughout a conversation.
Conversation Relevancy	Conversational	Determines whether your LLM chatbot is able to consistently generate relevant responses throughout a conversation.
Knowledge Retention	Conversational	Assesses the ability of your LLM chatbot to retain factual information presented throughout a conversation.
User Satisfaction	Conversational	Evaluates how satisfied the user was with the chatbot interaction, focusing on efficiency and user sentiment.
User Objective Accomplished	Conversational	Evaluates whether the chatbot successfully and correctly fulfilled the user’s stated objective, optionally verifying against an expected_output.
Non-Toxic	Red Teaming	Determines whether the responses of your LLM based product responds are free of toxic language.
Unbiased	Red Teaming	Determine whether your LLM output is free of gender, racial, or political bias.
Misuse Resilience	Red Teaming	Evaluates whether the generated output is resilient to misuse and remains aligned with the product description.
Data Leakage	Red Teaming	Evaluates whether the LLM returns content that may include sensitive information.
Jailbreak Resilience	Red Teaming	Evaluates the ability of an LLM-based product to resist attempts at breaking or manipulating its intended behavior.

Metric Properties

Name

Text

required

The name of the metric. Example: “Factual Accuracy”

Description

Text

A brief description of what the metric evaluates.

Documentation URL

URL

A URL pointing to more detailed documentation about the metric.

Legacy At

Timestamp

A timestamp indicating when the metric was marked as legacy or deprecated. This field will be null for active metrics.

Evaluator Model

Text

The name of the model used to evaluate the metric. This model will be used to assess the quality of the outputs based on the metric’s criteria. Example: “GPT-4.1”.

It does not apply to deterministic metrics since it does not require a model for evaluation.

Parameter	Description	Availability
input	The prompt or query sent to the model.	Quality Testing, Red Teaming, and User Scenarios (Partial Prompt only)
actual_output	The actual output generated by the model.	Quality Testing, Red Teaming, and User Scenarios (Partial Prompt only)
expected_output	The ideal answer for the given input.	Quality Testing and Red Teaming
context	Additional background information provided to the model alongside the input.	All metrics
retrieval_context	The context retrieved by your RAG system before sending the user query to your LLM.	Quality Testing, Red Teaming, and User Scenarios (Partial Prompt only)
product_description	The description of the product.	All metrics
product_capabilities	The capabilities of the product.	All metrics
product_inabilities	The product’s known inabilities or restrictions.	All metrics
product_security_boundaries	The security boundaries of the product.	All metrics
user_persona	Information about the user interacting with the agent.	Scenarios tests
goal	The user’s objective in the conversation.	Scenarios tests
scenario	The context or situation for the conversation.	Scenarios tests
stopping_criterias	List of criteria that define when a conversation should end.	Scenarios tests
conversation_turns	All turns in a conversation, including user and assistant messages.	User Scenarios tests (Full Prompt only)

Creating Custom Metrics via Partial Prompt (LLM)

The non-deterministic metrics, powered by Large Language Models that act as Judges, utilize models that have demonstrated the best performance in our internal benchmarks and testing. We are committed to continuously evolving and improving these evaluator models to ensure the highest quality assessments over time.

Custom Judge metrics use custom judge prompts to evaluate LLM outputs based on your specific requirements. The Partial Prompt method simplifies metric creation by focusing on the core evaluation logic.

How Partial Prompt (LLM) Work

Partial Prompt (LLM) is a simplified approach where you provide only the core evaluation criteria and scoring rubrics. Unlike Full Prompt where you write the complete evaluation template with placeholders like {input} and {actual_output}, Partial Prompt lets Galtea automatically construct the final prompt by combining your criteria with the selected evaluation parameters. The process:

You write the evaluation criteria (what to check for)
You define the scoring rubrics (how to score)
You select which evaluation parameters to include from the list below
Galtea automatically constructs the complete evaluation prompt

Evaluation Parameters

When creating a Partial Prompt metric, you select which of these parameters are relevant for your evaluation:

input - The original user input or prompt
actual_output - The response generated by your model
expected_output - The ideal or reference response
context - Additional background information
retrieval_context - RAG-retrieved context
product_description - Your product’s description
product_capabilities - What your product can do
product_inabilities - What your product cannot or should not do
product_security_boundaries - Security restrictions
user_persona - Information about the user interacting with the agent
goal - The user’s goal
scenario - The scenario in which the user is operating
stopping_criterias - List of criteria that define when the conversation should end

Example Judge Prompt

Here’s an example of what you provide for a Partial Prompt metric:

judge_prompt = """
**Evaluation Criteria:**
Check if the ACTUAL_OUTPUT is good by comparing it to what was expected. Focus on:
1. Factual accuracy and correctness
2. Completeness of the ACTUAL_OUTPUT, regarding the user INPUT
3. Appropriate use of provided CONTEXT information to answer the user INPUT
4. Overall helpfulness and relevance to the user INPUT

**Rubric:**
Score 1 (Good): The ACTUAL_OUTPUT is accurate, complete, uses information properly, and truly helps the user.
Score 0 (Bad): The ACTUAL_OUTPUT has major errors, missing parts, ignores important info, or doesn't help the user.
"""

Creating Custom Metrics via Full Prompt (LLM)

Custom Judge metrics use custom judge prompts to evaluate LLM outputs based on your specific requirements. The primary way to create these metrics is by providing a judge_prompt that defines the evaluation logic.

How Full Prompt (LLM) Work

Full Prompt (LLM) are templates that tell the evaluator model how to assess your outputs with the highest level of customization possible. You can include placeholders for various parameters that will be automatically filled in during evaluation:

{input} - The original user input or prompt
{actual_output} - The response generated by your model
{expected_output} - The ideal or reference response
{context} - Additional background information
{retrieval_context} - RAG-retrieved context
{product_description} - Your product’s description
{product_capabilities} - What your product can do
{product_inabilities} - What your product cannot or should not do
{product_security_boundaries} - Security restrictions
{user_persona} - Information about the user interacting with the agent
{goal} - The user’s goal
{scenario} - The scenario in which the user is operating
{stopping_criterias} - List of criteria that define when the conversation should end
{conversation_turns} - All turns in the conversation

Evaluation Process

Two-step Galtea judge process:

Assessment: The evaluator model analyzes the inputs according to your judge prompt
Score Computation: The model assigns a score based on the given scale of the rubrics, which is then normalized to 0–1 via token-probability weighting

Example Judge Prompt

judge_prompt = """
You are an expert evaluator. Evaluate the factual accuracy of the given response by comparing it to the reference answer and considering how well it addresses the user's input.

**User Input:** {input}
**Reference Answer:** {expected_output}
**Response to Evaluate:** {actual_output}

Scoring Guidelines:
- Score 0: The response contains factual errors, omits essential information needed for the input, or includes unsupported content.
- Score 0.5: The response is partially accurate. It is generally relevant but includes minor inaccuracies, missing key details, or some unsupported claims.
- Score 1: The response fully aligns with the reference answer with no errors, omissions, or unsupported additions.
"""

When designing judge prompts, be specific about your scoring criteria and reference the evaluation parameters explicitly. This ensures consistent and reliable evaluations.

Example Judge Prompt with `conversation_turns`

When evaluating a SCENARIOS test, you can use the {conversation_turns} placeholder to analyze the entire dialogue. The placeholder will be replaced with a formatted string representing the conversation. Example Prompt:

judge_prompt = """
Evaluate the following conversation for consistency. The agent should not contradict itself across the conversation.

**Conversation History:**
{conversation_turns}

**Evaluation Criteria:**
- Does the agent's final response contradict any of its previous statements?
- Are the agent's responses logically consistent throughout the conversation?

**Scoring Guidelines:**
- Score 1: The agent is fully consistent with no contradictions across all turns.
- Score 0.5: The agent has minor inconsistencies that don't significantly impact the conversation quality.
- Score 0: The agent clearly contradicts itself or provides conflicting information.
"""

Formatted conversation_turns: When the prompt is sent for evaluation, the {conversation_turns} placeholder will be replaced with a formatted string like this:

user: Hello, what can you do?
assistant: I can help with your queries and provide information.
user: What's your return policy?
assistant: Returns are accepted within 30 days of purchase.
user: Can I return items after a month?
assistant: No, our return window is 30 days.

The conversation_turns parameter is only available for SCENARIOS tests and can only be used with the Full Prompt validation method. It provides the complete conversation history, allowing you to evaluate contextual consistency, knowledge retention, and other conversational qualities.

SDK Integration

The Galtea SDK allows you to create, view, and manage metrics programmatically.

Metrics Service SDK

Manage metrics using the Python SDK

Evaluation

The assessment of an evaluation using a specific metric’s criteria

Concepts

Metrics

Test Types

What are Metrics?

Conceptual categories

List of metrics available in the Galtea Platform

Metric Properties

Creating Custom Metrics via Partial Prompt (LLM)

How Partial Prompt (LLM) Work

Evaluation Parameters

Example Judge Prompt

Creating Custom Metrics via Full Prompt (LLM)

How Full Prompt (LLM) Work

Evaluation Process

Example Judge Prompt

Example Judge Prompt with `conversation_turns`

SDK Integration

Metrics Service SDK

Evaluation

Concepts

Metrics

Test Types

​What are Metrics?

​Conceptual categories

​List of metrics available in the Galtea Platform

​Metric Properties

​Creating Custom Metrics via Partial Prompt (LLM)

​How Partial Prompt (LLM) Work

​Evaluation Parameters

​Example Judge Prompt

​Creating Custom Metrics via Full Prompt (LLM)

​How Full Prompt (LLM) Work

​Evaluation Process

​Example Judge Prompt

​Example Judge Prompt with conversation_turns

​SDK Integration

Metrics Service SDK

​Related Concepts

Evaluation

What are Metrics?

Conceptual categories

List of metrics available in the Galtea Platform

Metric Properties

Creating Custom Metrics via Partial Prompt (LLM)

How Partial Prompt (LLM) Work

Evaluation Parameters

Example Judge Prompt

Creating Custom Metrics via Full Prompt (LLM)

How Full Prompt (LLM) Work

Evaluation Process

Example Judge Prompt

Example Judge Prompt with `conversation_turns`

SDK Integration

Related Concepts