Skip to main content
Run an evaluation pipeline for a Version. This resolves all Specifications linked to the version’s product, collects their linked Metrics and Tests, and runs the evaluation.

Two Execution Modes

When no agent is provided, the evaluation is delegated to the server. The API calls your deployed HTTP endpoint for each test case, generates inference results, and evaluates them. Requires the version to have a conversation_endpoint_connection_id configured.

Returns

  • jobId — the inference batch job ID for tracking progress
  • testCaseCount — total number of test cases queued
  • message — a human-readable status message
  • specifications — summary of each specification evaluated

Examples

Endpoint connection (no agent):
    # Run evaluation using your deployed endpoint connection
    result = galtea.evaluations.run(version_id=version_id)
    print(f"Job {result['jobId']} queued {result['testCaseCount']} test cases")
    for spec in result["specifications"]:
        print(f"  Spec {spec['specificationId']}: {spec['testCount']} tests, {spec['metricCount']} metrics")
With specific specifications:
    # Evaluate only specific specifications
    result = galtea.evaluations.run(
        version_id=version_id,
        specification_ids=specification_ids,
    )
With a local agent:
# Run evaluation with a local agent (SDK-side loop)
def my_agent(user_message: str) -> str:
    # Replace with your actual agent logic
    return "Agent response"

result = galtea.evaluations.run(
    version_id=version_id,
    agent=my_agent,
    specification_ids=specification_ids[:1],
)
print(f"Processed {result['testCaseCount']} test cases")
print(f"Created {len(result['evaluations'])} evaluations")

Parameters

version_id
string
required
The ID of the version to evaluate.
agent
AgentType
The agent to execute locally. Accepts:
  • An Agent class instance with a call() method
  • An agent function (sync or async) with one of three signatures: (str) -> str, (list[dict]) -> str, or (AgentInput) -> AgentResponse
When omitted, the server-side endpoint connection pipeline is used.
specification_ids
List[str]
A list of Specification IDs to evaluate. When omitted, all specifications for the product that have linked metrics and tests are used.
Specifications without linked metrics or without linked tests are silently skipped.