Measures whether the agent called exactly the expected set of tools for a test case, penalizing missing required tools or using any unnecessary ones.
The Tool Correctness metric is one of the Deterministic Metric Galtea exposes to objectively verify that an agent used the correct tools for a test case. It is most appropriate when you expect the agent to call a specific set of tools (e.g., function/tool-calling workflows), and you want to penalize missing required tools or the use of any unnecessary ones.
Parse Inputs
Read tools_used (the tools the agent actually called) and expected_tools (the tools required by the test case). Typically, expected_tools is specified by the user when creating the test case, while tools_used is populated automatically from execution traces captured via Galtea’s @trace decorator.
Normalize and Dedupe Tool Names
Normalize tool identifiers (e.g., sanitize formatting and strip equivalent prefixes) and deduplicate by canonical name so repeated calls do not affect the result.
Final Score
Tool Correctness returns a value of 0 or 1 in this implementation:
1 – The agent called all expected tools and no unnecessary ones (order and duplicates ignored).
0 – The agent missed at least one expected tool or called at least one extra tool.