Run Evaluation - Galtea Docs

Run an evaluation pipeline for a Version. This resolves all Specifications linked to the version’s product, collects their linked Metrics and Tests, and runs the evaluation.

Two Execution Modes

Endpoint Connection
Agent (SDK-side)

When no agent is provided, the evaluation is delegated to the server. The API calls your deployed HTTP endpoint for each test case, generates inference results, and evaluates them. Requires the version to have a conversation_endpoint_connection_id configured.

Returns

jobId — the inference batch job ID for tracking progress
testCaseCount — total number of test cases queued
message — a human-readable status message
specifications — summary of each specification evaluated

Use galtea.jobs.cancel(job_id) to stop the batch before it finishes — for example, if you discover a misconfiguration after the job has already been queued.

When an agent is provided, the SDK runs the agent locally for each test case. For every specification, the method resolves linked tests and test cases, creates sessions, executes the agent, and triggers server-side evaluation.

Returns

evaluations — list of all Evaluation objects created
testCaseCount — total number of test cases processed
specifications — summary of each specification evaluated, including testCaseCount

Examples

Endpoint connection (no agent):

    # Run evaluation using your deployed endpoint connection
    result = galtea.evaluations.run(version_id=version_id)
    print(f"Job {result['jobId']} queued {result['testCaseCount']} test cases")
    for spec in result["specifications"]:
        print(f"  Spec {spec['specificationId']}: {spec['testCount']} tests, {spec['metricCount']} metrics")

With specific specifications:

    # Evaluate only specific specifications
    result = galtea.evaluations.run(
        version_id=version_id,
        specification_ids=specification_ids,
    )

With a local agent:

# Run evaluation with a local agent (SDK-side loop)
def my_agent(user_message: str) -> str:
    # Replace with your actual agent logic
    return "Agent response"

result = galtea.evaluations.run(
    version_id=version_id,
    agent=my_agent,
    specification_ids=specification_ids[:1],
)
print(f"Processed {result['testCaseCount']} test cases")
print(f"Created {len(result['evaluations'])} evaluations")

Parameters

version_id

string

required

The ID of the version to evaluate.

agent

AgentType

The agent to execute locally. Accepts:

An Agent class instance with a call() method
An agent function (sync or async) with one of three signatures: (str) -> str, (list[dict]) -> str, or (AgentInput) -> AgentResponse. See the AgentInput reference for all available fields.

When omitted, the server-side endpoint connection pipeline is used.

specification_ids

List[str]

A list of Specification IDs to evaluate. When omitted, all specifications for the product that have linked metrics and tests are used.

Specifications without linked metrics or without linked tests are silently skipped.

​Two Execution Modes

​Returns

​Returns

​Examples

​Parameters

Two Execution Modes

Returns

Returns

Examples

Parameters