The BLEU (bilingual evaluation understudy) metric is one of the Deterministic Metric Galtea exposes to objectively compare a model’s output against ground‑truth references. It is most appropriate when you expect the model to reproduce wording very close to the reference (e.g., machine translation or tightly constrained generation).Documentation Index
Fetch the complete documentation index at: https://docs.galtea.ai/llms.txt
Use this file to discover all available pages before exploring further.
Evaluation Parameters
To compute thebleu metric, the following parameters are required:
actual_output: The model’s generated text.expected_output: The reference (or gold) text to compare against.
How Is It Calculated?
Conceptually, BLEU is computed as:-
N‑gram Precision
Compute modified precision for n‑grams of sizes 1, 2, 3, and 4: how many of the candidate’s n‑grams also appear in the reference(s), clipped by the maximum count in the references. -
Geometric Mean of Precisions
Combine the n‑gram precisions with a geometric mean to balance contributions from different n‑gram sizes. -
Brevity Penalty (BP)
Penalize overly short candidates: where is candidate length and is reference length. - Final Score where are the modified n‑gram precisions and are their weights (uniform, ).
- ≥ 0.6 – Very strong overlap / near-reference quality.
- 0.3 – 0.6 – Moderate overlap; acceptable for some machine translation tasks.
- < 0.3 – Low overlap; likely poor or very different phrasing.
Suggested Test Case Types
Use BLEU when you have deterministic, reference-style outputs, such as:- Machine Translation (primary use case).
- Data-to-Text Generation where the wording is tightly constrained.
- Template-like Summaries or Headline Generation where near-exact phrasing matters.
- Paraphrase Detection (strict) when you want to reward lexical overlap (with caution).