Evaluation Parameters
To compute thefactual_accuracy
metric, the following parameters are required:
input
: The original user query or question that prompted the response.expected_output
: The reference or ground truth answer that the model’s output should be compared against.actual_output
: The response generated by your LLM application.
How Is It Calculated?
Factual Accuracy is computed using an LLM-as-a-judge process that evaluates the response on three key dimensions: factual correctness, completeness relative to the user’s input, and absence of unsupported claims. The evaluation process involves:- Input Contextualization: The LLM considers the user’s original input to understand what information is essential for an adequate response.
-
Fact Verification: Each factual claim in the
actual_output
is checked for correctness against theexpected_output
. - Completeness Assessment: The response is evaluated to ensure it contains all essential information needed to properly address the user’s input.
- Unsupported Content Detection: The LLM identifies any claims or information in the response that are not supported by the reference answer.
- Score Assignment: The final score is determined based on a three-tier scale:
- Score 1: Response fully aligns with the reference answer with no errors, omissions, or unsupported additions
- Score 0.5: Response is generally relevant but includes minor inaccuracies, missing key details, or some unsupported claims
- Score 0: Response contains factual errors, omits essential information needed for the input, or includes unsupported content
This metric is inspired by best practices in the open source community and is implemented natively in the Galtea platform.