What is Human Evaluation?
Human Evaluation is a Galtea feature that lets your team manually review and score AI outputs. Instead of an LLM judge, evaluations enter aPENDING_HUMAN status and wait for a human annotator to submit a score through the dashboard.
It is especially useful when you need subjective judgment, domain expertise, or a human-in-the-loop quality gate that automated scoring cannot provide.
Human Evaluation requires at least one user group with assigned members. User groups control which annotators can review evaluations for a given metric. You can manage groups in the Groups tab of your organization, or via the SDK’s
galtea.user_groups service.How It Works
You create a metric with the Human Evaluation type and assign one or more user groups. When evaluations run against that metric, the platform sets their status toPENDING_HUMAN instead of invoking an LLM judge. Annotators in the linked user groups then review each output in the dashboard, submit a score with an optional reason, and the evaluation moves to completed status.
How to Use Human Evaluation
Create a Human Evaluation metric
Go to the Metrics creation form and configure the following:
- Evaluation Type — Select Human Evaluation so evaluations are routed to human annotators instead of an LLM judge.
- User Groups — Assign one or more user groups to control which users can annotate evaluations for this metric. A user group organizes evaluators within your organization — only members of the linked groups will see pending evaluations in their dashboard. If you don’t have a group yet, you can create one directly from the metric form.
- Evaluation Parameters — Choose which parameters (Input, Expected Output, Actual Output, etc.) annotators will see when reviewing.
-
Evaluation Guidelines — Write the criteria and instructions annotators should follow when scoring. You can also select a template as a starting point.

Run Evaluations
Run evaluations using the SDK or from the dashboard just like any other metric. The only difference is that each evaluation will enter
PENDING_HUMAN status instead of being scored by an LLM.See Run Test-Based Evaluations, Evaluating Conversations, or Direct Inferences and Evaluations from the Platform for step-by-step instructions.See Also
- Metrics — Learn about metrics and evaluation types in Galtea
- User Groups — Manage who can annotate human evaluations
- Evaluations — Understand how evaluations work
- Run Test-Based Evaluations — Run evaluations for single-turn workflows
- Evaluating Conversations — Evaluate multi-turn conversations
- Direct Inferences and Evaluations from the Platform — Run evaluations from the dashboard without SDK code

