Evaluation Parameters
To compute thebleu
metric, the following parameters are required:
actual_output
: The model’s generated text.expected_output
: The reference (or gold) text to compare against.
How Is It Calculated?
Conceptually, BLEU is computed as:-
N‑gram Precision
Compute modified precision for n‑grams of sizes 1, 2, 3, and 4: how many of the candidate’s n‑grams also appear in the reference(s), clipped by the maximum count in the references. -
Geometric Mean of Precisions
Combine the n‑gram precisions with a geometric mean to balance contributions from different n‑gram sizes. -
Brevity Penalty (BP)
Penalize overly short candidates: where is candidate length and is reference length. - Final Score where are the modified n‑gram precisions and are their weights (uniform, ).
- ≥ 0.6 – Very strong overlap / near-reference quality.
- 0.3 – 0.6 – Moderate overlap; acceptable for some machine translation tasks.
- < 0.3 – Low overlap; likely poor or very different phrasing.
Suggested Test Case Types
Use BLEU when you have deterministic, reference-style outputs, such as:- Machine Translation (primary use case).
- Data-to-Text Generation where the wording is tightly constrained.
- Template-like Summaries or Headline Generation where near-exact phrasing matters.
- Paraphrase Detection (strict) when you want to reward lexical overlap (with caution).