Skip to main content
The IOU (Intersection Over Union) metric is part of the Deterministic Metric options in Galtea. It evaluates the degree of spatial overlap between a single predicted bounding box and a set of ground truth boxes. This is especially useful for tasks involving layout analysis, bounding box prediction, or image-based text extraction, where location matters as much as content.

Evaluation Parameters

To compute the iou metric, the following parameters must be provided:
  • actual_output: A list of predicted bounding boxes. Each box must be in [x1, y1, x2, y2] format (coordinates of top-left and bottom-right corners). Accepts two formats:
    • JSON array: "[[10, 10, 50, 50], [100, 100, 120, 120]]"
    • JSON object with “bboxes” key: '{"bboxes": [[10, 10, 50, 50], [100, 100, 120, 120]]}'
  • expected_output: A single ground truth bounding box in [x1, y1, x2, y2] format. Accepts two formats:
    • JSON array: "[10, 10, 50, 50]"
    • JSON object with “bbox” key (singular): '{"bbox": [10, 10, 50, 50]}'
Important: Both actual_output and expected_output must be valid JSON. Truncated or malformed JSON will cause evaluation failures.

How Is It Calculated?

  1. The predicted bounding box is compared against each ground truth box.
  2. For each comparison, the IoU is calculated as: IoU=Area of IntersectionArea of Union\text{IoU} = \frac{\text{Area of Intersection}}{\text{Area of Union}}
  3. The maximum IoU value across all comparisons is returned as the final score.
This method is particularly useful when the ground truth includes multiple bounding boxes for a single answer (e.g., one box per word in a multi-word phrase).

Interpretation of Scores

  • ≥ 0.7 – Strong spatial alignment.
  • 0.4 – 0.7 – Moderate overlap; may require refinement.
  • < 0.4 – Poor alignment; predicted box diverges from reference.
Note: These thresholds may be adjusted based on task-specific precision requirements.

Suggested Test Case Types

Use IoU when evaluating:
  • Layout-aware predictions, such as bounding boxes in OCR or form extraction.
  • Visual document understanding, where spatial positioning is essential.
  • Object or text region detection in image or PDF-based tasks.