Evaluation Parameters
To compute themeteor
metric, the following parameters are required:
actual_output
: The model’s generated text.expected_output
: The reference (or gold) text to compare against.
How Is It Calculated?
METEOR improves upon BLEU by considering semantic and morphological matches:-
Alignment
Tokens are matched between candidate and reference using:- Exact matches
- Stems (e.g., “run” vs. “running”)
- Synonyms (e.g., “big” vs. “large”)
-
Precision & Recall
Both are calculated from the aligned tokens. -
Fragmentation Penalty
A penalty is applied if matches are scattered (fragmented alignment). -
Final Score
The score is computed as: where F_mean is the harmonic mean of precision and recall.
- ≥ 0.6 – High-quality translation/summary with semantic fidelity.
- 0.3 – 0.6 – Moderate quality; some paraphrasing or structural divergence.
- < 0.3 – Low-quality or semantically incorrect output.
Suggested Test Case Types
Use METEOR when evaluating:- Machine Translation with varied phrasings.
- Abstractive Summarization where synonyms are common.
- Paraphrase Detection with semantic variation.