Metric Type

What are Metric Types?

Metric types in Galtea define the specific criteria and methods used to evaluate the performance of your product. They determine how outputs are scored during evaluation tasks, ensuring consistent and meaningful assessment.

Metric types are organization-wide and can be reused across multiple products.

You can create, view and manage your metric types on the Galtea dashboard or programmatically using the Galtea SDK.

Conceptual categories

At Galtea, we use two types of metrics to evaluate large language model (LLM) outputs: deterministic and non-deterministic.

Deterministic metrics are rule-based and computed using strict logic, such as SQL queries or structural checks. These include things like answer format validation, presence of required fields, or exact string matches. Their results are consistent and reproducible. See more in our example.

Common examples of deterministic checks include:
- Answer format validation (e.g., ensuring the output is a valid JSON or follows a specific template)
- Presence of required fields (e.g., checking if all necessary information is included in the response)
- Exact string matches (e.g., verifying if a specific keyword or phrase is present)
- Numerical range checks (e.g., confirming a value falls within an acceptable range)
- Boolean condition checks (e.g., ensuring a specific condition evaluates to true or false as expected)
The platform cannot automatically evaluate deterministic metrics, as it lacks the necessary information. Therefore, you are responsible for uploading the evaluation results to visualize charts and data based on these metrics.
Non-deterministic metrics are powered by the G-Eval evaluation framework, an LLM-as-a-judge methodology, originally introduced in the paper “NLG Evaluation using GPT-4 with Better Human Alignment”. G-Eval follows a structured two-step process (chain-of-thought generation and score computation) to rate outputs. These metrics assess aspects like factuality, coherence, helpfulness, or tone.

Galtea metrics are versatile and can be applied to any type of output, including strings, numbers, and boolean values.

For deterministic metrics: Evaluating numerical or boolean outputs is straightforward. For example, you can check if a returned numerical value is within a valid range or if a boolean output matches a specific condition.
For non-deterministic metrics: While typically used for open-ended text, they can also assess the reasoning or justification behind a numerical or boolean value when needed. For instance, verifying whether a model’s numeric prediction aligns with the provided context or input data.

List of metrics available in the Galtea Platform

The following table provides a summary of the default metrics available in the Galtea platform. You can also create custom metrics tailored to your specific needs.

Metric	Category	Description
Factual Accuracy	RAG	Measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the expected_output.
Misuse Resilience	RAG	Evaluates whether the generated output is resilient to misuse and remains aligned with the product description.
Answer Relevancy	RAG	Measures the quality of your RAG pipeline’s generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input.
Faithfulness	RAG	Measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the contents of your retrieval_context.
Contextual Precision	RAG	Measures your RAG pipeline’s retriever by evaluating whether nodes in your retrieval_context that are relevant to the given input are ranked higher than irrelevant ones.
Contextual Recall	RAG	Measures the quality of your RAG pipeline’s retriever by evaluating the extent of which the retrieval_context aligns with the expected_output.
Contextual Relevancy	RAG	Measures the quality of your RAG pipeline’s retriever by evaluating the overall relevance of the information presented in your retrieval_context for a given input.
Role Adherence	Conversational	Determines whether your LLM chatbot is able to adhere to its given role throughout a conversation.
Conversation Completeness	Conversational	Determines whether your LLM chatbot is able to complete an end-to-end conversation by satisfying user needs throughout a conversation.
Conversation Relevancy	Conversational	Determines whether your LLM chatbot is able to consistently generate relevant responses throughout a conversation.
Knowledge Retention	Conversational	Assesses the ability of your LLM chatbot to retain factual information presented throughout a conversation.
Non-Toxic	Red Teaming	Determines whether the responses of your LLM based product responds are free of toxic language.
Unbiased	Red Teaming	Determine whether your LLM output is free of gender, racial, or political bias.

Metric Type Properties

Name

Text

required

The name of the metric type. Example: “Factual Accuracy”

Description

Text

A brief description of what the metric type evaluates.

Documentation URL

URL

A URL pointing to more detailed documentation about the metric type.

Parameter	Description
input	The prompt or query sent to the model. (Always required in the list).
actual output	The actual output generated by the model.
expected output	The ideal answer for the given input.
context	Additional background information provided to the model alongside the input.
retrieval context	The context retrieved by your RAG system before sending the user query to your LLM.

G-Eval

The non-deterministic metrics, powered by LLM-as-a-judge (G-Eval), utilize models that have demonstrated the best performance in our internal benchmarks and testing. We are committed to continuously evolving and improving these evaluator models to ensure the highest quality assessments over time.

Evaluation Criteria vs. Evaluation Steps

When using non-deterministic metrics, you can choose between two approaches: evaluation criteria and evaluation steps. Both methods are designed to assess the quality of LLM outputs, but they differ in their focus and structure.

Evaluation Criteria

What matters in a response, defining the high-level qualities or standards

Evaluation Steps

How to measure a response’s quality, providing specific assessment actions

Evaluation Criteria

Evaluation criteria are high-level qualities or standards that define what makes a response good or bad. They outline fundamental aspects that should be assessed without specifying exactly how to measure them.

Purpose

Examples

Evaluation criteria define what matters in a response, serving as the foundation for meaningful assessment.

Evaluation Steps

Evaluation steps are the specific actions taken to measure how well a response meets the evaluation criteria. These steps break down the assessment into concrete, structured processes that reference evaluation parameters.

Purpose

Example: Accuracy Steps

Example: Completeness Steps

Evaluation steps define how to measure a response’s quality based on the evaluation criteria, making explicit reference to specific evaluation parameters like Input, Actual Output, Expected Output, Retrieval Context, and Context.

When you do supply explicit evaluation_steps, G-Eval skips auto-generation—yielding more reliable, reproducible scores ideal for production.

Comparing Evaluation Approaches

The following table highlights the key differences between evaluation criteria and evaluation steps:

Aspect	Evaluation Criteria	Evaluation Steps
Definition	High-level qualities that define what makes a response good or bad	Step-by-step actions to measure a response’s quality
Purpose	Establish broad goals for evaluation	Provide a systematic method to assess responses
Focus	What should be measured	How to measure it
Examples	Accuracy, conciseness, relevance, fluency	Compare facts, check for contradictions, assess completeness
Flexibility	General principles that apply across many use cases	Specific steps that vary depending on the system

Evaluation Details

Two-step process

Chain-of-Thought Generation (optional): If you do not provide evaluation_steps, G-Eval auto-generates them from your criteria via a chain-of-thought prompt—making setup fast but introducing some variability.
Score Computation: G-Eval prompts the LLM with the steps and test-case parameters (input, actual_output, etc.) to rate on a 1–5 scale, then normalizes the result to 0–1 via token-probability weighting.

SDK Integration

The Galtea SDK allows you to create, view, and manage metric types programmatically.

Metrics Service SDK

Manage metric types using the Python SDK

Evaluation

A group of evaluable Inference Results from a particular session

Evaluation Task

The assessment of an evaluation using a specific metric type’s criteria

Concepts

Metrics

Test Types

What are Metric Types?

Conceptual categories

List of metrics available in the Galtea Platform

Metric Type Properties

G-Eval

Evaluation Criteria vs. Evaluation Steps

Evaluation Criteria

Evaluation Steps

Evaluation Criteria

Evaluation Steps

Comparing Evaluation Approaches

Evaluation Details

SDK Integration

Metrics Service SDK

Evaluation

Evaluation Task

Concepts

Metrics

Test Types

​What are Metric Types?

​Conceptual categories

​List of metrics available in the Galtea Platform

​Metric Type Properties

​G-Eval

​Evaluation Criteria vs. Evaluation Steps

Evaluation Criteria

Evaluation Steps

​Evaluation Criteria

​Evaluation Steps

​Comparing Evaluation Approaches

​Evaluation Details

​SDK Integration

Metrics Service SDK

​Related Concepts

Evaluation

Evaluation Task

What are Metric Types?

Conceptual categories

List of metrics available in the Galtea Platform

Metric Type Properties

G-Eval

Evaluation Criteria vs. Evaluation Steps

Evaluation Criteria

Evaluation Steps

Comparing Evaluation Approaches

Evaluation Details

SDK Integration

Related Concepts