A practical guide for evaluating your agentic system: use Galtea’s tracing feature to capture your agent’s traces and then define your specific scope of interest by creating a custom metric.
Use this file to discover all available pages before exploring further.
Use the @trace decorator to capture every operation your agent performs. Once captured, traces become an evaluation parameter called traces — the full execution path from user input to system output.Galtea provides built-in metrics like Tool Correctness, but agentic systems are often highly specific. We recommend creating custom judge metrics that check the exact behaviors you care about. Below are practical examples to get you started.
**Evaluation Criteria:**Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:1. Whether Tool X is invoked at least once in the traces2. Whether Tool Y is invoked only after Tool X3. Whether the ordering of tool calls follows the intended system logic**Rubric:**Score 1 (Good): Tool X is called before Tool Y, and Tool Y is never called first in the traces.Score 0 (Bad): Tool Y is called before Tool X, or Tool X is never called.
Example 2: Were all attributes collected before calling Tool X?
Example of Judge Prompt:
**Evaluation Criteria:**Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:1. Whether attribute_a, attribute_b, and attribute_c appear in the traces before Tool X is called2. Whether these attributes are explicitly present in the traces3. Whether the attributes are derived from user input or tool outputs rather than invented**Rubric:**Score 1 (Good): All required attributes are explicitly present and correctly sourced before Tool X is invoked.Score 0 (Bad): Any attribute is missing, implicit, or invented before calling Tool X.
Example 3: Do attributes come from the correct sources?
Example of Judge Prompt:
**Evaluation Criteria:**Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:1. Whether attribute_b originates from Tool Y output rather than user input2. Whether attribute_c is recomputed after Tool Z when Tool Z is called3. Whether attributes are overridden only when justified by a tool output**Rubric:**Score 1 (Good): Attributes in the traces originate from the correct tools and are updated appropriately.Score 0 (Bad): Any attribute is sourced incorrectly, reused improperly, or overridden without justification.
Example 4: Are tools used under required conditions?
Example of Judge Prompt:
**Evaluation Criteria:**Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:1. Whether Tool Z is called only when attribute_a exceeds the required threshold2. Whether Tool X is skipped when attribute_c is false3. Whether conditional branching in the traces matches tool results**Rubric:**Score 1 (Good): Tools are invoked or skipped strictly according to the required conditions.Score 0 (Bad): Any tool is called or skipped in violation of the defined conditions.
Ensure tool outputs are actually used.Example of Judge Prompt:
**Evaluation Criteria:**Check if the TRACES reflect correct agent behavior by comparing them to what was expected. Focus on:1. Whether field_d from Tool X output is referenced later in the traces2. Whether Tool Y output influences subsequent reasoning or tool calls3. Whether tool outputs are respected and not contradicted later in the traces**Rubric:**Score 1 (Good): Tool outputs are explicitly used and consistently reflected in later steps.Score 0 (Bad): Tool outputs are ignored, unused, or contradicted.