Evaluation Parameters
To compute therouge
metric, the following parameters are required:
actual_output
: The model’s generated summary.expected_output
: The reference (or gold) summary to compare against.
How Is It Calculated?
This implementation uses ROUGE-L, which focuses on the Longest Common Subsequence (LCS) between the candidate and reference:-
Longest Common Subsequence
Identifies the longest sequence of words that appears in both candidate and reference (not necessarily contiguous, but in the same order). -
Precision & Recall
- Precision (P) = LCS length / candidate length
- Recall (R) = LCS length / reference length
-
F1 Score
Combines precision and recall:
- ≥ 0.5 – Strong overlap with the reference summary.
- 0.3 – 0.5 – Moderate overlap; acceptable for abstractive summarization.
- < 0.3 – Weak overlap; likely missing key content.
Suggested Test Case Types
Use ROUGE when evaluating:- Abstractive or extractive summaries.
- Headline generation where recall of important tokens matters.
- Content coverage tests for long text generation.