- Deterministic Metrics: When you have custom, rule-based logic to score outputs (e.g., checking for specific keywords, validating JSON structure).
- Integrating External Models: When you use your own models for evaluation and want to store their scores in Galtea.
Recommended Approach: MetricInput Dictionary
The recommended way to specify metrics in SDK v3.0 is using theMetricInput dictionary format. For self-hosted metrics, you have two equally valid options for providing scores:
Option 1: Pre-Compute the Score
If you want to calculate the score yourself before creating the evaluation, you can provide the score directly as a float:Option 2: Use CustomScoreEvaluationMetric Class
If you prefer to encapsulate your scoring logic in a class that will be executed dynamically, you can use theCustomScoreEvaluationMetric class within the MetricInput dictionary:
When using
CustomScoreEvaluationMetric as the score field in a MetricInput dictionary, do NOT provide id or name in the dictionary itself. The metric identifier must be specified when initializing the CustomScoreEvaluationMetric instance (e.g., CustomScoreEvaluationMetric(name="my-metric")).Choosing Between Options
Both approaches are equally valid and current. Choose based on your preference:-
Use Option 1 (Pre-Computed Score) if:
- You prefer a simpler, more declarative style
- Your scoring logic is straightforward and doesn’t require encapsulation
- You want to separate score calculation from the evaluation submission
-
Use Option 2 (CustomScoreEvaluationMetric Class) if:
- You prefer object-oriented design
- Your scoring logic is complex and benefits from encapsulation
- You want the SDK to handle score calculation automatically
- You need to reuse the same metric logic across multiple evaluations
Multi-Turn Custom Metrics
When usingCustomScoreEvaluationMetric, your measure() method always receives an inference_results parameter with InferenceResult objects. For session evaluations (session_id=...) this includes all turns; for single inference result evaluations (inference_result_id=...) it contains one item. This enables conversation-level scoring such as consistency checks, cross-turn analysis, or aggregated metrics.
The
inference_results parameter is always a list ordered chronologically. For single-turn evaluations (via inference_result_id), it contains one item. For session evaluations, it contains all turns. See Evaluating Conversations for the full session-based workflow.