Evaluation Parameters
To compute thecustom_judge
metric, the following parameters can be considered, depending on your needs:
input
: The original user message.expected_output
: The ideal or target answer your product should have generated.actual_output
: The response actually produced by the model.context
: Any additional internal context passed to the model (e.g. memory, conversation history).retrieval_context
: External content retrieved during inference, such as from a vector store or knowledge base.product_description
: A high-level overview of what the product is.product_capabilities
: What the product is designed to do.product_inabilities
: What the product cannot or should not do.product_security_boundaries
: Sensitive actions or behaviors that the model must not perform, even if technically capable.
How Is It Calculated?
The Custom Judge evaluation process includes:-
Prompt Construction: You define the evaluation logic by writing your own
judge_prompt
. This prompt can include contextual information, specific behavioral rubrics, product constraints, and any other evaluation criteria relevant to your use case. Galtea fills in the prompt using the parameters you provide at runtime. -
Score and Justification: Based on your custom prompt, Galtea returns a structured output containing:
Example ‘judge_prompt’ template
Suggested Test Case Types
The Custom Judge metric is best suited for:- Behavioral Evaluation: Ensuring model behavior aligns with defined product guidelines or safety constraints.
- Security Testing: Validating that responses do not cross ethical, privacy, or safety boundaries.
- Policy Adherence: Checking compliance with brand tone, content rules, or moderation policies.
- Retrieval Use: Evaluating whether the model made appropriate use of retrieved context.