The Factual Accuracy metric is a non-deterministic Metric Type used by Galtea to evaluate the performance of your AI products, especially in Retrieval-Augmented Generation (RAG) and question answering systems. It measures whether the information in the model’s output is factually correct when compared to a trusted reference answer.

This metric helps ensure that your LLM-generated responses are not only relevant, but also factually accurate, reducing the risk of hallucinations or misinformation in your product’s outputs.


Evaluation Parameters

To compute the factual_accuracy metric, the following parameters are required:

  • expected_output: The reference or ground truth answer that the model’s output should be compared against.
  • actual_output: The response generated by your LLM application.

How Is It Calculated?

Factual Accuracy is computed using an LLM-as-a-judge process. The LLM compares the actual_output to the expected_output and determines whether the generated response is factually correct, complete, and free of hallucinations. The process typically involves:

  1. Fact Extraction: The LLM identifies key facts or statements in both the expected_output and actual_output.
  2. Fact Comparison: Each fact in the actual_output is checked for correctness against the expected_output.
  3. Score Calculation: The final score reflects the proportion of facts in the actual_output that are accurate and supported by the expected_output.

A higher score indicates that the model’s output is more factually aligned with the reference answer.

This metric is inspired by best practices in the open source community and is implemented natively in the Galtea platform. For more on factual accuracy evaluation, see deepeval documentation.