BLEU

The BLEU (bilingual evaluation understudy) metric is one of the Deterministic Metric Galtea exposes to objectively compare a model’s output against ground‑truth references. It is most appropriate when you expect the model to reproduce wording very close to the reference (e.g., machine translation or tightly constrained generation).

Evaluation Parameters

To compute the bleu metric, the following parameters are required:

actual_output: The model’s generated text.
expected_output: The reference (or gold) text to compare against.

How Is It Calculated?

Conceptually, BLEU is computed as:

N‑gram Precision
Compute modified precision for n‑grams of sizes 1, 2, 3, and 4: how many of the candidate’s n‑grams also appear in the reference(s), clipped by the maximum count in the references.
Geometric Mean of Precisions
Combine the n‑gram precisions with a geometric mean to balance contributions from different n‑gram sizes.
Brevity Penalty (BP)
Penalize overly short candidates: $BP = \begin{cases} 1 & \text{if } c > r \\ e^{(1 - \frac{r}{c})} & \text{if } c \le r \end{cases}$ where $c$ is candidate length and $r$ is reference length.
Final Score $BLEU = BP \cdot \exp\left(\sum_{n=1}^{4} w_n \log p_n \right)$ where $p_n$ are the modified n‑gram precisions and $w_n$ are their weights (uniform, $w_n = \frac{1}{4}$ ).

BLEU returns a value between 0 and 1 in this implementation.

≥ 0.6 – Very strong overlap / near-reference quality.
0.3 – 0.6 – Moderate overlap; acceptable for some machine translation tasks.
< 0.3 – Low overlap; likely poor or very different phrasing.

Suggested Test Case Types

Use BLEU when you have deterministic, reference-style outputs, such as:

Machine Translation (primary use case).
Data-to-Text Generation where the wording is tightly constrained.
Template-like Summaries or Headline Generation where near-exact phrasing matters.
Paraphrase Detection (strict) when you want to reward lexical overlap (with caution).

ROUGE
METEOR

Concepts

Metrics

Test Types

Evaluation Parameters

How Is It Calculated?

Suggested Test Case Types

Concepts

Metrics

Test Types

​Evaluation Parameters

​How Is It Calculated?

​Suggested Test Case Types

​Related Topics

Evaluation Parameters

How Is It Calculated?

Suggested Test Case Types

Related Topics