Evaluation Parameters
To compute theTool Correctness metric, the following parameters are required:
tools_used: The tools used by the agent. This evaluation parameter is automatically extracted from thetracesparameter.expected_tools: The reference (or gold) list of tools to compare against.
How Is It Calculated?
Conceptually, Tool Correctness is computed as:-
Parse Inputs
Readtools_used(the tools the agent actually called) andexpected_tools(the tools required by the test case). Typically,expected_toolsis specified by the user when creating the test case, whiletools_usedis populated automatically from execution traces captured via Galtea’s @trace decorator. -
Normalize and Dedupe Tool Names
Normalize tool identifiers (e.g., sanitize formatting and strip equivalent prefixes) and deduplicate by canonical name so repeated calls do not affect the result. -
Final Score
Tool Correctness returns a value of 0 or 1 in this implementation:- 1 – The agent called all expected tools and no unnecessary ones (order and duplicates ignored).
- 0 – The agent missed at least one expected tool or called at least one extra tool.
Suggested Test Case Types
Use Tool Correctness when you want to verify tool-calling behavior, such as:- Tool-Required Tasks where specific tools must be invoked.
- Agent Workflow Validation where using an extra tool is considered incorrect or risky.
- Regression Tests for agents to ensure they keep calling the same required tools after prompt/model changes.
- Safety/Policy-Gated Flows where unexpected tool usage should fail the evaluation.