This guide will walk you through the essential steps to begin evaluating and monitoring the reliability of your AI products with Galtea.

Evaluation Workflow Overview

Evaluating AI products with Galtea follows this pattern:

1

Create a Product

Define what functionality or service you want to evaluate

2

Register a Version

Document the specific implementation of your product

3

Create a Test

Define a set of test cases to evaluate your product’s performance

4

Define Metrics

Select or create criteria to assess outputs

5

Run Evaluations

Test your product and analyze the results

Let’s go through each step in more detail.

1. Creating a Product

The first step in tracking the quality and reliability of your AI product is to create a product in the Galtea dashboard.

Navigate to Products > Create New Product and complete the product onboarding form. The product description is particularly important as it may be used during the generation of synthetic test data.

Products can only be created through the web platform, not the SDK. For detailed information about product properties, see the Product documentation.

2. Install the SDK and Connect

After creating your product, we recommend to use the Galtea SDK to programmatically interact with the platform.

1

Get your API key

In the Galtea platform, navigate to Settings > Generate API Key

3

Connect to the platform

Using the Galtea SDK object, you can easily connect to the platform:

from galtea import Galtea

# Initialize with your API key
galtea = Galtea(api_key="YOUR_API_KEY")

# List your products to verify connection
products = galtea.products.list()

3. Register a Version

One of the key advantages of the Galtea platform is the ability to track and compare different versions of your AI product. A version captures the specific implementation details such as prompts, model parameters, or RAG configurations.

version = galtea.versions.create(
    name="v1.0",
    product_id="YOUR_PRODUCT_ID",
    optional_props={
        "description": "Initial version"
        # Add other metadata like system_prompt, endpoint, etc.
    }
)

The product’s ID can be found in the product’s page in the Galtea platform or by listing products using the SDK.

You can create versions using either the SDK (as shown above) or directly through the Galtea platform dashboard.

Version Service API

Learn about all version properties and management capabilities

4. Create a Test

To compare the reliability of different versions, you need to subject each version to the same tests. Galtea supports two test types:

You can either upload your own test file or have Galtea generate tests from a knowledge base (ground truth) document:

test = galtea.tests.create(
    name="example-test",
    type="QUALITY",
    product_id="YOUR_PRODUCT_ID",
    test_file_path="path/to/your/test_file.csv"
    # Or use ground_truth_file_path="path/to/knowledge_file.pdf"
)

Tests can be created using either the SDK (as shown above) or directly through the Galtea platform dashboard.

Create a Custom Test

See complete examples of creating and uploading tests

5. Define Metrics

Metrics in Galtea define the criteria by which your product’s outputs will be evaluated. These metric types can be reused across different evaluations.

metric = galtea.metrics.create(
    name="accuracy_v1",
    criteria="Determine whether the 'actual output' is equivalent to the 'expected output'."
    evaluation_params=["input", "expected output", "actual output"],
)

Metric types can be created using either the SDK (as shown above) or directly through the Galtea platform dashboard.

Metrics Service API

Learn about creating and managing evaluation metrics

6. Run Evaluations

Finally, you’re ready to launch an evaluation to assess how well your product version performs against the test cases:

For real evaluations, you’ll typically run your AI product against all sets of inputs and context in your test cases and then launch an Evaluation Task for each.

The platform will asynchronously evaluate responses and make results available through the dashboard and the SDK. For more information, see evaluation tasks.

# Create an evaluation
evaluation = galtea.evaluations.create(
    test_id=test.id,
    version_id=version.id
)

# Load your test cases
test_cases = galtea.test_cases.list(test_id=test.id)

# Evaluate all test cases
for test_case in test_cases:
    # Your product's actual response to the input
    actual_output = your_product_function(test_case.input, test_case.context)
    
    # Run evaluation task
    galtea.evaluation_tasks.create(
        metrics=["accuracy_v1"],
        evaluation_id=evaluation.id,
        test_case_id=test_case.id,
        actual_output=actual_output,
    )

Evaluations can be created using either the SDK or the Galtea platform dashboard, but evaluation tasks can only be created through the SDK.

Run Evaluations

See complete examples of running and analyzing evaluations

7. View Results

You can view evaluation results through the SDK or on the Galtea platform:

# Retrieve evaluation tasks
tasks = galtea.evaluation_tasks.list(evaluation.id)
if tasks:
    result = galtea.evaluation_tasks.get(tasks[0].id)
    print(f"Score: {result.score}")
    print(f"Reason: {result.reason}")

For richer analysis and comparisons between versions, visit the Analytics section of your product in the Galtea platform.

Next Steps

Congratulations! You’ve completed your first evaluation with Galtea. For more detailed information:

If you have any questions, contact us at support@galtea.ai.