METEOR

The METEOR (Metric for Evaluation of Translation with Explicit ORdering) metric is one of the Deterministic Metric Galtea exposes for evaluating machine translation, summarization, and paraphrasing tasks. It aims to correlate more closely with human judgments compared to BLEU.

Evaluation Parameters

To compute the meteor metric, the following parameters are required:

actual_output: The model’s generated text.
expected_output: The reference (or gold) text to compare against.

How Is It Calculated?

METEOR improves upon BLEU by considering semantic and morphological matches:

Alignment
Tokens are matched between candidate and reference using:
- Exact matches
- Stems (e.g., “run” vs. “running”)
- Synonyms (e.g., “big” vs. “large”)
Precision & Recall
Both are calculated from the aligned tokens.
Fragmentation Penalty
A penalty is applied if matches are scattered (fragmented alignment).
Final Score
The score is computed as: $METEOR = F_{mean} \times (1 - Penalty)$ where F_mean is the harmonic mean of precision and recall.

METEOR returns a score between 0 and 1:

≥ 0.6 – High-quality translation/summary with semantic fidelity.
0.3 – 0.6 – Moderate quality; some paraphrasing or structural divergence.
< 0.3 – Low-quality or semantically incorrect output.

Suggested Test Case Types

Use METEOR when evaluating:

Machine Translation with varied phrasings.
Abstractive Summarization where synonyms are common.
Paraphrase Detection with semantic variation.

BLEU
ROUGE

Concepts

Metrics

Test Types

Evaluation Parameters

How Is It Calculated?

Suggested Test Case Types

Concepts

Metrics

Test Types

​Evaluation Parameters

​How Is It Calculated?

​Suggested Test Case Types

​Related Topics

Evaluation Parameters

How Is It Calculated?

Suggested Test Case Types

Related Topics