Migrating to Galtea SDK v3.0

Welcome to Galtea SDK v3.0! This version introduces major improvements, including a new session-based evaluation workflow and simplifications to the existing test-based workflow. This guide will walk you through the necessary changes to update your SDK v2.x integrations and introduce you to the new features available in v3.0.

Part 1: Migrating Your Existing Workflow (Test-Case-Based Evaluation)

The core workflow for running predefined tests against a version has been streamlined.

Key Changes for Test-Based Evaluations

New Terminology: The old EvaluationTask entity is now simply called Evaluation. Similarly, MetricType is now called Metric. The concept of a parent Evaluation container has been removed to simplify the workflow.
Simplified Workflow: The previous two-step process of creating an Evaluation container and then adding EvaluationTasks to it is now a single method call.
New Method create_single_turn(): The old methods for creating tasks (e.g., galtea.evaluation_tasks.create()) are replaced by galtea.evaluations.create_single_turn(). This new method directly creates evaluations without needing a pre-existing container. Similarly, the old galtea.metric_types service is now galtea.metrics.
Updated Parameters: The create_single_turn() method now requires a version_id instead of an evaluation_id, as the direct link is now to the version being tested.
New MetricInput Format: SDK v3.0 introduces a flexible dictionary format for specifying metrics. You can now pass metrics as dictionaries with optional id, name, and score fields. For self-hosted metrics, you can either provide pre-computed scores as floats or use CustomScoreEvaluationMetric instances for dynamic score calculation—both approaches are equally valid. The old formats (strings and top-level CustomScoreEvaluationMetric objects) are still supported for backward compatibility.
Simplified Version Creation: The galtea.versions.create() method now accepts all properties as direct keyword arguments, removing the need for the optional_props dictionary.
Added Sessions: The new galtea.sessions.create() method allows you to create sessions to group multiple inference results (conversation turns) under a single session, making it easier to track multi-turn interactions.

Migration Diff: Test-Case Workflow

Here’s a side-by-side comparison of a typical v2 script and its direct v3 equivalent.

❌ SDK v2.x (Old Way)

from galtea import Galtea
galtea = Galtea(api_key="...")

# 1. Create version with optional_props dictionary
version = galtea.versions.create(
    name="v2.0-summarizer",
    product_id="YOUR_PRODUCT_ID",
    optional_props={
      "description": "Old version using gpt-3.5",
      "model_id": "gpt-3.5-turbo"
    }
)

# 2. In v2, you first created an "Evaluation" container
evaluation_container = galtea.evaluations.create(
    test_id="YOUR_TEST_ID",
    version_id=version.id
)

# 3. Then, you created individual "EvaluationTasks" inside it
galtea.evaluation_tasks.create(
    evaluation_id=evaluation_container.id,
    test_case_id="YOUR_TEST_CASE_ID",
    metrics=["your_metric"],
    actual_output="This is the model's output."
)

✅ SDK v3.0 (New Way)

from galtea import Galtea
galtea = Galtea(api_key="...")

# 1. Create version with direct keyword arguments
version = galtea.versions.create(
    name="v3.0-summarizer",
    product_id="YOUR_PRODUCT_ID",
    description="New version using gpt-4",
    model_id="gpt-4"
)

# 2. Skip explicit evaluation creation (it's automatic!)

# 3. Create evaluations using version_id
galtea.evaluations.create_single_turn(
    version_id=version.id,
    test_case_id="YOUR_TEST_CASE_ID",
    metrics=["your_metric"],
    actual_output="This is the model's output."
)

This example demonstrates how to create single-turn evaluations for test cases. For production data logging, include the input parameter and set is_production=True. If you want to evaluate multi-turn conversations, refer to the new session-based workflow in Part 2.

Production Data Logging with Single-Turn Evaluations

For production monitoring, you can also use create_single_turn() without a test case. Here’s an example using the new recommended MetricInput dictionary format:

# Production data logging (no test case required)
# Using the recommended MetricInput dictionary format
galtea.evaluations.create_single_turn(
    version_id=version.id,
    input="User's actual input",
    actual_output="Model's actual output",
    metrics=[
        {"name": "your_metric"},
        {"name": "custom-metric", "score": 0.95}  # Pre-computed score for self-hosted metric
    ],
    is_production=True
)

# Alternative: String format (still supported)
galtea.evaluations.create_single_turn(
    version_id=version.id,
    input="User's actual input",
    actual_output="Model's actual output",
    metrics=["your_metric"],  # Simple string format
    is_production=True
)

Summary of Actions for Migration

In galtea.versions.create() calls, remove the optional_props dictionary and pass its contents as direct keyword arguments.
Remove all calls to galtea.evaluations.create() that were used to create evaluation containers.
Rename galtea.evaluation_tasks.create() calls to galtea.evaluations.create_single_turn(), remove the evaluation_id parameter, and add the version_id parameter.
Replace all usages of galtea.metric_types with galtea.metrics. The method names within the service remain largely the same (e.g., get, get_by_name, create).
Recommended: Update your metrics specifications to use the new MetricInput dictionary format for better clarity and to leverage the ability to provide pre-computed scores for self-hosted metrics.

New MetricInput Format

SDK v3.0 introduces a powerful new way to specify metrics using dictionaries:

# Legacy formats (still supported)
metrics=["Role Adherence", CustomMetricClass()]

# Current format (recommended)
metrics=[
    {"name": "Role Adherence"},                    # Galtea-hosted metric by name
    {"id": "metric_xyz"},                          # Galtea-hosted metric by ID
    {"name": "custom", "score": 0.95},             # Self-hosted with pre-computed score
    {"score": CustomScoreEvaluationMetric(name="custom")}  # Self-hosted with dynamic scoring
]

Benefits of MetricInput format:

Clearer intent: Explicitly specify whether you’re referencing by name or ID
Flexibility for self-hosted metrics: Choose between pre-computing scores or using CustomScoreEvaluationMetric for dynamic calculation—both are equally valid
Better maintainability: More declarative and structured than legacy formats

For self-hosted metrics, you have two equal options:

Pre-compute scores: Calculate the score yourself and provide it as a float
Dynamic scoring: Use a CustomScoreEvaluationMetric instance to calculate scores on-the-fly

The legacy formats (plain strings like "Role Adherence" and top-level CustomScoreEvaluationMetric objects) continue to work for backward compatibility.

Ensure all your code, including any examples you may be referencing from the SDK repository, is updated. The optional_props dictionary is fully deprecated in v3.0, and some examples in the codebase may not yet reflect the v3.0 changes.

Part 2: What’s New in v3.0 - The Session-Based Workflow

SDK v3.0 introduces a powerful new way to log and evaluate multi-turn conversations through Sessions and Inference Results. This approach is ideal for monitoring production traffic or evaluating complex, interactive scenarios.

New Concepts

Session: A container for a sequence of interactions (a conversation) between a user and your AI product. You can create a session using galtea.sessions.create().
Inference Result: A single turn within a session, containing the input and output. You log these using galtea.inference_results.create().
evaluations.create(): A new way to run evaluations on all inference results within a given session. This allows for easy batch evaluation of entire conversations.

Example of the New Session-Based Workflow

This workflow is entirely new and does not directly replace the test-based workflow.

# SDK v3.0 New Workflow Example
from galtea import Galtea

galtea = Galtea(api_key="YOUR_API_KEY")

YOUR_PRODUCT_ID = "your_product_id"
YOUR_VERSION_ID = "your_version_id"
YOUR_TEST_CASE_ID = "your_test_case_id"
YOUR_METRIC_NAME = "your_metric_name"

# 1. Create a Session to group a conversation
session = galtea.sessions.create(
    version_id=YOUR_VERSION_ID,
    test_case_id=YOUR_TEST_CASE_ID,
)
print(f"Created Session: {session.id}")

# 2. Log all inference results in a single efficient batch call
conversation_turns = [
    {
        "input": "What's your return policy?",
        "output": "Our return policy allows returns within 30 days of purchase."
    },
    {
        "input": "What if I lost the receipt?",
        "output": "A proof of purchase is required for all returns."
    },
    {
        "input": "Can I return items bought online to a physical store?",
        "output": "Yes, you can return online purchases to any of our physical store locations."
    },
    {
        "input": "What about exchanges?",
        "output": "Exchanges follow the same 30-day policy and can be done in-store or online."
    }
]

# Use create_batch for efficient bulk processing
galtea.inference_results.create_batch(
    session_id=session.id,
    conversation_turns=conversation_turns
)

print(f"Stored {len(conversation_turns)} inference results in session {session.id} with a single call")

# Or loop through conversation turns
# for turn in conversation_turns:
#     galtea.inference_results.create(
#         session_id=session.id,
#         input=turn["input"],
#         output=turn["output"]
#     )

# 3. When ready, create evaluations for the entire session
evaluations = galtea.evaluations.create(
    session_id=session.id,
    metrics=[YOUR_METRIC_NAME]
)
print(f"Submitted {len(evaluations)} evaluations for session {session.id}")

For better performance with multiple conversation turns, always use create_batch() instead of calling create() in a loop. This reduces network overhead and improves response times.

Benefits of the New Workflow

Track Full Conversations: Accurately log and analyze multi-turn user interactions.
Production Monitoring: Easily send production data to Galtea for continuous evaluation by removing the test_case_id parameter when creating sessions and adding is_production=True.
Batch Evaluation: Evaluate an entire conversation with a single command.

If you have any questions or encounter any issues during migration, please don’t hesitate to reach out to our support team at [email protected]

Getting Started

Tutorials

Integrations

Migration to v3

Migrating to Galtea SDK v3.0

Part 1: Migrating Your Existing Workflow (Test-Case-Based Evaluation)

Key Changes for Test-Based Evaluations

Migration Diff: Test-Case Workflow

❌ SDK v2.x (Old Way)

✅ SDK v3.0 (New Way)

Production Data Logging with Single-Turn Evaluations

Summary of Actions for Migration

New MetricInput Format

Part 2: What’s New in v3.0 - The Session-Based Workflow

New Concepts

Example of the New Session-Based Workflow

Benefits of the New Workflow

Getting Started

Tutorials

Integrations

​Migrating to Galtea SDK v3.0

​Part 1: Migrating Your Existing Workflow (Test-Case-Based Evaluation)

​Key Changes for Test-Based Evaluations

​Migration Diff: Test-Case Workflow

​❌ SDK v2.x (Old Way)

​✅ SDK v3.0 (New Way)

​Production Data Logging with Single-Turn Evaluations

​Summary of Actions for Migration

​New MetricInput Format

​Part 2: What’s New in v3.0 - The Session-Based Workflow

​New Concepts

​Example of the New Session-Based Workflow

​Benefits of the New Workflow

Migrating to Galtea SDK v3.0

Part 1: Migrating Your Existing Workflow (Test-Case-Based Evaluation)

Key Changes for Test-Based Evaluations

Migration Diff: Test-Case Workflow

❌ SDK v2.x (Old Way)

✅ SDK v3.0 (New Way)

Production Data Logging with Single-Turn Evaluations

Summary of Actions for Migration

New MetricInput Format

Part 2: What’s New in v3.0 - The Session-Based Workflow

New Concepts

Example of the New Session-Based Workflow

Benefits of the New Workflow