To evaluate a product version against a predefined Test, you can loop through its Test Cases and run an evaluation task for each one. This workflow is ideal for regression testing and standardized quality checks.

An Evaluation is created implicitly the first time you run an evaluation task for a specific version_id and test_id combination. You do not need to create it manually.

Workflow

1

Select a Test and Version

Identify the test_id and version_id you want to evaluate.

2

Iterate Through Test Cases

Fetch all test cases associated with the test using galtea.test_cases.list().

3

Generate and Evaluate Output

For each test case, call your product to get its output, then use galtea.evaluation_tasks.create_single_turn() to create the evaluation tasks.

Example

This example demonstrates how to run an evaluation on all test cases from a specific test.

from galtea import Galtea
import os

# Initialize Galtea SDK
galtea = Galtea(api_key=os.getenv("GALTEA_API_KEY"))

# --- Configuration ---
# Replace with your actual IDs
YOUR_VERSION_ID = "your_version_id"
YOUR_TEST_ID = "your_test_id"
METRICS_TO_EVALUATE = ["factual-accuracy", "coherence"] 

# --- Your Product's Logic (Placeholder) ---
# In a real scenario, this would call your actual AI model or API
def your_product_function(input_prompt: str, context: str = None) -> str:
    # Your model's inference logic goes here
    # For this example, we'll return a simple placeholder response
    return f"Model response to: {input_prompt}"

# 1. Fetch the test cases for the specified test
test_cases = galtea.test_cases.list(test_id=YOUR_TEST_ID)
print(f"Found {len(test_cases)} test cases for Test ID: {YOUR_TEST_ID}")

# 2. Loop through each test case and run an evaluation task
for test_case in test_cases:
    # Get your product's actual response to the input
    actual_output = your_product_function(test_case.input)
    
    # Create the evaluation task
    # An evaluation is created implicitly on the first call
    galtea.evaluation_tasks.create_single_turn(
        version_id=YOUR_VERSION_ID,
        test_case_id=test_case.id,
        metrics=METRICS_TO_EVALUATE,
        actual_output=actual_output,
    )

print(f"\nAll evaluation tasks submitted for Version ID: {YOUR_VERSION_ID}")

A session and evaluation is automatically created behind the scenes to link this version_id and test_id with the provided inference result (the actual_output and the Test Case’s input).