Factual Accuracy
Evaluates whether the generated output is factually accurate and adequately addresses the user’s input when compared to the reference answer.
The Factual Accuracy metric is a non-deterministic Metric Type used by Galtea to evaluate the performance of your AI products, especially in Retrieval-Augmented Generation (RAG) and question answering systems. It measures whether the information in the model’s output is factually correct, complete, and appropriately addresses the user’s input when compared to a trusted reference answer.
This metric helps ensure that your LLM-generated responses are not only relevant and factually accurate, but also comprehensive enough to adequately answer the user’s question, reducing the risk of hallucinations, misinformation, or incomplete responses in your product’s outputs.
Evaluation Parameters
To compute the factual_accuracy
metric, the following parameters are required:
input
: The original user query or question that prompted the response.expected_output
: The reference or ground truth answer that the model’s output should be compared against.actual_output
: The response generated by your LLM application.
How Is It Calculated?
Factual Accuracy is computed using an LLM-as-a-judge process that evaluates the response on three key dimensions: factual correctness, completeness relative to the user’s input, and absence of unsupported claims. The evaluation process involves:
-
Input Contextualization: The LLM considers the user’s original input to understand what information is essential for an adequate response.
-
Fact Verification: Each factual claim in the
actual_output
is checked for correctness against theexpected_output
. -
Completeness Assessment: The response is evaluated to ensure it contains all essential information needed to properly address the user’s input.
-
Unsupported Content Detection: The LLM identifies any claims or information in the response that are not supported by the reference answer.
-
Score Assignment: The final score is determined based on a three-tier scale:
- Score 1: Response fully aligns with the reference answer with no errors, omissions, or unsupported additions
- Score 0.5: Response is generally relevant but includes minor inaccuracies, missing key details, or some unsupported claims
- Score 0: Response contains factual errors, omits essential information needed for the input, or includes unsupported content
The scoring system provides granular feedback on response quality, allowing you to identify and address different types of factual accuracy issues in your AI system.