Primarily used for machine translation evaluation, measuring how many n-grams (phrases of n words) in the candidate translation overlap with those in a set of reference translations.
The BLEU (bilingual evaluation understudy) metric is one of the Deterministic Metric Galtea exposes to objectively compare a model’s output against ground‑truth references. It is most appropriate when you expect the model to reproduce wording very close to the reference (e.g., machine translation or tightly constrained generation).
N‑gram Precision
Compute modified precision for n‑grams of sizes 1, 2, 3, and 4: how many of the candidate’s n‑grams also appear in the reference(s), clipped by the maximum count in the references.
Geometric Mean of Precisions
Combine the n‑gram precisions with a geometric mean to balance contributions from different n‑gram sizes.
Brevity Penalty (BP)
Penalize overly short candidates:BP={1e(1−cr)if c>rif c≤rwhere c is candidate length and r is reference length.
Final ScoreBLEU=BP⋅exp(n=1∑4wnlogpn)where pn are the modified n‑gram precisions and wn are their weights (uniform, wn=41 ).
BLEU returns a value between 0 and 1 in this implementation.
≥ 0.6 – Very strong overlap / near-reference quality.
0.3 – 0.6 – Moderate overlap; acceptable for some machine translation tasks.
< 0.3 – Low overlap; likely poor or very different phrasing.