Skip to main content

What is Human Evaluation?

Human Evaluation is a Galtea feature that lets your team manually review and score AI outputs. Instead of an LLM judge, evaluations enter a PENDING_HUMAN status and wait for a human annotator to submit a score through the dashboard. It is especially useful when you need subjective judgment, domain expertise, or a human-in-the-loop quality gate that automated scoring cannot provide.
Human Evaluation requires at least one user group with assigned members. User groups control which annotators can review evaluations for a given metric. You can manage groups in the Groups tab of your organization, or via the SDK’s galtea.user_groups service.

How It Works

You create a metric with the Human Evaluation type and assign one or more user groups. When evaluations run against that metric, the platform sets their status to PENDING_HUMAN instead of invoking an LLM judge. Annotators in the linked user groups then review each output in the dashboard, submit a score with an optional reason, and the evaluation moves to completed status.

How to Use Human Evaluation

1

Create a Human Evaluation metric

Go to the Metrics creation form and configure the following:
  • Evaluation Type — Select Human Evaluation so evaluations are routed to human annotators instead of an LLM judge.
  • User Groups — Assign one or more user groups to control which users can annotate evaluations for this metric. A user group organizes evaluators within your organization — only members of the linked groups will see pending evaluations in their dashboard. If you don’t have a group yet, you can create one directly from the metric form.
  • Evaluation Parameters — Choose which parameters (Input, Expected Output, Actual Output, etc.) annotators will see when reviewing.
  • Evaluation Guidelines — Write the criteria and instructions annotators should follow when scoring. You can also select a template as a starting point. Create a Human Evaluation metric
2

Run Evaluations

Run evaluations using the SDK or from the dashboard just like any other metric. The only difference is that each evaluation will enter PENDING_HUMAN status instead of being scored by an LLM.See Run Test-Based Evaluations, Evaluating Conversations, or Direct Inferences and Evaluations from the Platform for step-by-step instructions.
3

Annotate in the Dashboard

Navigate to the Human Evaluations page in the sidebar.Human Evaluations pending viewClick Start Evaluating to open the annotation dialog. For each evaluation, review the conversation and context, submit a score, and optionally add a reason.Annotation dialog

See Also