Migrating to Galtea SDK v3.0
Welcome to Galtea SDK v3.0! This version introduces major improvements, including a new session-based evaluation workflow and simplifications to the existing test-based workflow. This guide will walk you through the necessary changes to update your SDK v2.x integrations and introduce you to the new features available in v3.0.Part 1: Migrating Your Existing Workflow (Test-Case-Based Evaluation)
The core workflow for running predefined tests against a version has been streamlined.Key Changes for Test-Based Evaluations
-
New Terminology: The old
EvaluationTaskentity is now simply calledEvaluation. Similarly,MetricTypeis now calledMetric. The concept of a parentEvaluationcontainer has been removed to simplify the workflow. -
Simplified Workflow: The previous two-step process of creating an
Evaluationcontainer and then addingEvaluationTasks to it is now a single method call. -
New Method
create_single_turn(): The old methods for creating tasks (e.g.,galtea.evaluation_tasks.create()) are replaced bygaltea.evaluations.create_single_turn(). This new method directly creates evaluations without needing a pre-existing container. Similarly, the oldgaltea.metric_typesservice is nowgaltea.metrics. -
Updated Parameters: The
create_single_turn()method now requires aversion_idinstead of anevaluation_id, as the direct link is now to the version being tested. -
New
MetricInputFormat: SDK v3.0 introduces a flexible dictionary format for specifying metrics. You can now pass metrics as dictionaries with optionalid,name, andscorefields. For self-hosted metrics, you can either provide pre-computed scores as floats or use CustomScoreEvaluationMetric instances for dynamic score calculation—both approaches are equally valid. The old formats (strings and top-level CustomScoreEvaluationMetric objects) are still supported for backward compatibility. -
Simplified Version Creation: The
galtea.versions.create()method now accepts all properties as direct keyword arguments, removing the need for theoptional_propsdictionary. -
Added Sessions: The new
galtea.sessions.create()method allows you to create sessions to group multiple inference results (conversation turns) under a single session, making it easier to track multi-turn interactions.
Migration Diff: Test-Case Workflow
Here’s a side-by-side comparison of a typical v2 script and its direct v3 equivalent.❌ SDK v2.x (Old Way)
✅ SDK v3.0 (New Way)
This example demonstrates how to create single-turn evaluations for test cases. For production data logging, include the
input parameter and set is_production=True. If you want to evaluate multi-turn conversations, refer to the new session-based workflow in Part 2.Production Data Logging with Single-Turn Evaluations
For production monitoring, you can also usecreate_single_turn() without a test case. Here’s an example using the new recommended MetricInput dictionary format:
Summary of Actions for Migration
- In
galtea.versions.create()calls, remove theoptional_propsdictionary and pass its contents as direct keyword arguments. - Remove all calls to
galtea.evaluations.create()that were used to create evaluation containers. - Rename
galtea.evaluation_tasks.create()calls togaltea.evaluations.create_single_turn(), remove theevaluation_idparameter, and add theversion_idparameter. - Replace all usages of
galtea.metric_typeswithgaltea.metrics. The method names within the service remain largely the same (e.g.,get,get_by_name,create). - Recommended: Update your metrics specifications to use the new
MetricInputdictionary format for better clarity and to leverage the ability to provide pre-computed scores for self-hosted metrics.
New MetricInput Format
SDK v3.0 introduces a powerful new way to specify metrics using dictionaries:- Clearer intent: Explicitly specify whether you’re referencing by name or ID
- Flexibility for self-hosted metrics: Choose between pre-computing scores or using CustomScoreEvaluationMetric for dynamic calculation—both are equally valid
- Better maintainability: More declarative and structured than legacy formats
- Pre-compute scores: Calculate the score yourself and provide it as a float
- Dynamic scoring: Use a CustomScoreEvaluationMetric instance to calculate scores on-the-fly
"Role Adherence" and top-level CustomScoreEvaluationMetric objects) continue to work for backward compatibility.
Ensure all your code, including any examples you may be referencing from the SDK repository, is updated. The
optional_props dictionary is fully deprecated in v3.0, and some examples in the codebase may not yet reflect the v3.0 changes.Part 2: What’s New in v3.0 - The Session-Based Workflow
SDK v3.0 introduces a powerful new way to log and evaluate multi-turn conversations through Sessions and Inference Results. This approach is ideal for monitoring production traffic or evaluating complex, interactive scenarios.New Concepts
- Session: A container for a sequence of interactions (a conversation) between a user and your AI product. You can create a session using
galtea.sessions.create(). - Inference Result: A single turn within a session, containing the
inputandoutput. You log these usinggaltea.inference_results.create(). evaluations.create(): A new way to run evaluations on all inference results within a given session. This allows for easy batch evaluation of entire conversations.
Example of the New Session-Based Workflow
This workflow is entirely new and does not directly replace the test-based workflow.For better performance with multiple conversation turns, always use
create_batch() instead of calling create() in a loop. This reduces network overhead and improves response times.Benefits of the New Workflow
- Track Full Conversations: Accurately log and analyze multi-turn user interactions.
- Production Monitoring: Easily send production data to Galtea for continuous evaluation by removing the
test_case_idparameter when creating sessions and addingis_production=True. - Batch Evaluation: Evaluate an entire conversation with a single command.