Skip to main content
So, you’ve built an amazing product – maybe an LLM, a RAG system, or another AI model – and you want to know just how good it truly is? That’s exactly where the Galtea platform comes in! We’re here to help you thoroughly test your product, uncover its strengths, and even spot those tricky edge cases you might not have considered. Let’s walk through the steps you’ll take to test your product with Galtea and understand why each part is so important.

Evaluation Workflow Overview

1

Tell Us About Your Product

Provide details about your product so our models can create tailored testing content.
2

Install SDK & Connect

Set up the Galtea Python SDK to interact with the platform.
3

Version Your Product

Track different implementations to compare improvements over time.
4

Create Your Tests

Build test datasets with input and ground truth pairs to challenge your product.
5

Choose Your Metrics

Select from our metrics library or bring your own custom metrics.
6

Run Evaluations

Test your product version with the selected test and metrics.
7

See Your Results

Explore detailed insights and compare versions on the dashboard.

1. Tell Us About Your Product

To test your product effectively, our state-of-the-art LLMs and other models need to understand it inside and out. The more information you provide, the better equipped our models will be to create unique and highly specific testing content tailored just for your product. Create a product in the Galtea dashboard. Navigate to Products > Create New Product and complete the form.
The product description is crucial – it powers our ability to generate synthetic test data that’s specific to your use case.

2. Get Started with the SDK

Just like with many other AI tools, you have options! You can do everything through our intuitive dashboard GUI, or you can get programmatic with our Python SDK. For this quickstart, we’ll focus on using the Python SDK.
1

Get your API key

In the Galtea dashboard, navigate to Settings > Generate API Key and copy your key.
2

Install the SDK

pip install galtea
3

Connect to the platform

from galtea import Galtea

galtea = Galtea(api_key="YOUR_API_KEY")

products = galtea.products.list() 
print(f"Found {len(products)} products.")
# Choose a product ID for the next steps
YOUR_PRODUCT_ID = products[0].id if products else "your_product_id"

3. Version Your Product

Imagine this: Galtea shows you that your product doesn’t perform perfectly on certain edge cases you hadn’t anticipated. Naturally, you go back, make some tweaks, and now you have a new, improved version! You test your new version again on Galtea. But how do you compare if the new model is better than the previous one? Galtea provides a robust way to version your product – attach a version name, ID, and metadata to each of your product/model versions. This means you can easily compare the results of different versions, filter your findings, and clearly see if your new version is, in fact, an improvement over the old one. Create a version to track a specific implementation of your product.
version = galtea.versions.create(
    name="v0.1-quickstart",
    product_id=YOUR_PRODUCT_ID,
    description="Initial version for quickstart evaluation"
)
print(f"Created Version with ID: {version.id}")

4. Create Your Tests

This is where we build the core of your evaluation: the test dataset. This dataset consists of input and ground truth pairs. We’ll feed the input to your product/model, and then compare its prediction against the ground truth to see how well it performed. We offer two main ways to generate this crucial test data:
  1. Using Your Product Description: If, for example, your product is a RAG system designed for financial data, we’ll automatically generate a test dataset that a RAG system for financial data should excel at.
  2. Using Your Specific Data: Do you have very particular PDFs or text files that your product absolutely must handle correctly? No problem! You can upload these files, and we’ll generate a test dataset directly from them, ensuring your product is tested on your critical content.
In Galtea, this entire test dataset is simply called a “test,” and each individual row within it is a “test case.” For this quickstart, we’ll use the default “Jailbreak” test, which is a type of Red Teaming Test.
test = galtea.tests.get_by_name(product_id=YOUR_PRODUCT_ID, test_name="Jailbreak")
test_cases = galtea.test_cases.list(test_id=test.id)
print(f"Using test '{test.name}' with {len(test_cases)} test cases.")

5. Choose Your Metrics

Anyone familiar with machine learning knows that after comparing predictions to ground truth, you get a “loss,” but you also need a meaningful metric to interpret that loss. This step is all about selecting the right metrics for your evaluation. You can bring your own custom metrics to the table, or you can choose from our wide array of available metrics (like QA, text-specific metrics, IOU, and many more). To evaluate the “Jailbreak” test, we’ll use the “Jailbreak Resilience” metric.
metric = galtea.metrics.get_by_name(name="Jailbreak Resilience")

6. Run the Evaluation

This step is exactly what it sounds like! We take each test case from your created tests, pass its input through your product/model, compare the output to the ground truth, and calculate the metric values. You’ll get detailed metric values for every single test case within your tests. Now, run evaluations against your test cases.
In a real scenario, your_product_function would be a call to your actual AI model.
# Placeholder for your actual product/model inference function
def your_product_function(input_prompt):
    if "ignore" in input_prompt.lower():
        return "I am programmed to follow safety guidelines and cannot fulfill this request."
    return f"Of course! I will now {input_prompt}"

# An evaluation is created implicitly with the first evaluation.
# Loop through test cases and create evaluations.
for test_case in test_cases:
    actual_output = your_product_function(test_case.input)
    
    galtea.evaluations.create_single_turn(
        version_id=version.id,
        test_case_id=test_case.id,
        metrics=[metric.name],
        actual_output=actual_output
    )

print(f"Submitted evaluations for version {version.id} using test '{test.name}'.")

7. See Your Results

Once you’ve run evaluations across all your tests and thoroughly tested your product/model, it’s time to dive into the insights! You can view all your results on the Galtea platform in a highly intuitive way. Our dashboard allows you to filter and explore your data however you like – by versions, metrics, specific tests, and more. You can see aggregate scores for an overall picture or drill down into the results of individual test cases. Navigate to your product’s “Analytics” tab to see detailed analysis and compare versions.
print(f"View results at: https://platform.galtea.ai/product/{YOUR_PRODUCT_ID}?tab=1")

Next Steps

Congratulations! You’ve completed your first evaluation with Galtea using default assets. This is just the beginning. Explore these concepts to tailor Galtea to your specific needs: If you have any questions or need assistance, contact us at [email protected].