Ways to evaluate and score product performance
CustomScoreMetric
base class and pass an instance of it to the evaluation task creation method. See more in our example.
CustomScoreMetric
class in the SDK, you can execute your custom logic locally and have the scores seamlessly uploaded to the platform for visualization.Metric | Category | Description |
---|---|---|
Factual Accuracy | RAG | Measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the expected_output. |
Resilience To Noise | RAG | Evaluates whether the generated output is resilient to noisy input, such as typos, OCR/ASR errors, and distracting content. |
Answer Relevancy | RAG | Measures the quality of your RAG pipeline’s generator by evaluating how relevant the actual_output of your LLM application is compared to the provided input. |
Faithfulness | RAG | Measures the quality of your RAG pipeline’s generator by evaluating whether the actual_output factually aligns with the contents of your retrieval_context. |
Contextual Precision | RAG | Measures your RAG pipeline’s retriever by evaluating whether nodes in your retrieval_context that are relevant to the given input are ranked higher than irrelevant ones. |
Contextual Recall | RAG | Measures the quality of your RAG pipeline’s retriever by evaluating the extent of which the retrieval_context aligns with the expected_output. |
Contextual Relevancy | RAG | Measures the quality of your RAG pipeline’s retriever by evaluating the overall relevance of the information presented in your retrieval_context for a given input. |
Custom Judge | RAG | A flexible LLM-as-a-judge evaluation metric designed to assess model outputs based on customizable product rubrics, domain-specific behavior expectations, and grounded context. |
BLEU | Deterministic | Measures how many n-grams in the actual output overlap with those in a set of expected output. |
ROUGE | Deterministic | Evaluates automatic summarization by measuring the longest common subsequence that preserves the word order between actual output and expected output summaries. |
METEOR | Deterministic | Evaluates translation, summarization, and paraphrasing by aligning words using exact matches, stems, or synonyms. |
Text Similarity | Deterministic | Quantifies the overall textual resemblance between a generated summary and a reference summary by using character-level fuzzy matching. |
Text Match | Deterministic | Determines whether generated text matches a reference with high character-level similarity using fuzzy matching, returning a binary outcome based on a threshold. |
IOU | Deterministic | Measures the spatial overlap between a predicted bounding box and one or more reference boxes to quantify alignment in object detection and layout tasks. |
Spatial Match | Deterministic | Performs a binary evaluation of the spatial alignment between a predicted bounding box and reference boxes using the best Intersection over Union (IoU) score to return a pass/fail signal. |
Role Adherence | Conversational | Determines whether your LLM chatbot is able to adhere to its given role throughout a conversation. |
Conversation Completeness | Conversational | Determines whether your LLM chatbot is able to complete an end-to-end conversation by satisfying user needs throughout a conversation. |
Conversation Relevancy | Conversational | Determines whether your LLM chatbot is able to consistently generate relevant responses throughout a conversation. |
Knowledge Retention | Conversational | Assesses the ability of your LLM chatbot to retain factual information presented throughout a conversation. |
Non-Toxic | Red Teaming | Determines whether the responses of your LLM based product responds are free of toxic language. |
Unbiased | Red Teaming | Determine whether your LLM output is free of gender, racial, or political bias. |
Misuse Resilience | Red Teaming | Evaluates whether the generated output is resilient to misuse and remains aligned with the product description. |
Data Leakage | Red Teaming | Evaluates whether the LLM returns content that may include sensitive information. |
Jailbreak Resilience | Red Teaming | A previous version of the Jailbreak Resilience v2 Metric Type. |
Jailbreak Resilience v2 | Red Teaming | Evaluates the ability of an LLM-based product to resist attempts at breaking or manipulating its intended behavior. |
Parameter | Description | Availability |
---|---|---|
input | The prompt or query sent to the model. (Always required in the list). | All metrics |
actual_output | The actual output generated by the model. | All metrics |
expected_output | The ideal answer for the given input. | All metrics |
context | Additional background information provided to the model alongside the input. | All metrics |
retrieval_context | The context retrieved by your RAG system before sending the user query to your LLM. | All metrics |
product_description | The description of the product. | Custom Judge only |
product_capabilities | The capabilities of the product. | Custom Judge only |
product_inabilities | The product’s known inabilities or restrictions. | Custom Judge only |
product_security_boundaries | The security boundaries of the product. | Custom Judge only |
input
(lowercase) must always be included in this list.Purpose
Examples
Purpose
Example: Accuracy Steps
Example: Completeness Steps
evaluation_steps
, G-Eval skips auto-generation—yielding more reliable, reproducible scores ideal for production.Aspect | Evaluation Criteria | Evaluation Steps |
---|---|---|
Definition | High-level qualities that define what makes a response good or bad | Step-by-step actions to measure a response’s quality |
Purpose | Establish broad goals for evaluation | Provide a systematic method to assess responses |
Focus | What should be measured | How to measure it |
Examples | Accuracy, conciseness, relevance, fluency | Compare facts, check for contradictions, assess completeness |
Flexibility | General principles that apply across many use cases | Specific steps that vary depending on the system |
evaluation_steps
, G-Eval auto-generates them from your criteria
via a chain-of-thought prompt—making setup fast but introducing some variability.input
, actual_output
, etc.) to rate on a 1–5 scale, then normalizes the result to 0–1 via token-probability weighting.