Skip to main content
This quickstart walks you through running your first evaluation in Galtea for a conversational product using the Python SDK. It is intended for a technical audience and covers RAG pipelines, conversational scenarios, and security or safety checks.

Evaluation Workflow Overview

1

Tell Us About Your Product

Provide details about your product so our models can create tailored testing content.
2

Install SDK & Connect

Set up the Galtea Python SDK to interact with the platform.
3

Version Your Product

Track different implementations to compare improvements over time.
4

Create Your Tests

Create a test dataset to generate the test cases you’ll evaluate.
5

Choose Your Metrics

Select from our metrics library or bring your own custom metrics.
6

Run Evaluations

Test your product version with the selected test and metrics.
7

See Your Results

Explore detailed insights and compare versions on the dashboard.

1. Tell Us About Your Product

If you’re new to Galtea, start by creating an account and your first product in the dashboard: Go to the dashboard. Follow the Onboarding checklist on the product page (this is the guided setup shown inside your product). It will prompt you for product details and help you generate an API key. The product details you provide there are used to tailor generated tests and evaluations.
AI-assisted product creation form
Keep your Product ID and API key handy for the SDK steps below.
The product information is crucial – it powers our ability to generate synthetic test data that’s specific to your use case.

2. Get Started with the SDK

You can use Galtea from the dashboard or via the Python SDK. This quickstart uses the Python SDK.
1

Get your API key

In the Galtea dashboard, copy the API key you created during your product page onboarding (or navigate to Settings > Generate API Key).
2

Install the SDK

pip install galtea
3

Connect to the platform

from galtea import Galtea

# Verify the SDK is properly installed by initializing it
# Note: Replace "YOUR_API_KEY" with your actual API key from the Galtea dashboard
galtea = Galtea(api_key="YOUR_API_KEY")
print("Galtea SDK installed successfully!")

3. Version Your Product

Use versions to track iterations of your product and compare evaluation results over time. Each evaluation is tied to a version, so you can see how changes impact metrics. If you completed the Onboarding checklist on your product page, you already have an initial version (created automatically). To find IDs in the dashboard, open your product and click the 3-dot menu on any item, then select Copy ID. Alternatively, list them via the SDK:
products = galtea.products.list()
for p in products:
    print(p.id, p.name)

versions = galtea.versions.list(product_id=products[0].id)
for v in versions:
    print(v.id, v.name)
Then set them once here:
product_id = "your_product_id"
version_id = "your_version_id"

4. Create Your Tests

A single conversational product typically includes multiple components. In Galtea, you evaluate each component by creating a different test type, think of each test as a lens on the same system:
  • RAG pipelines → create an Accuracy test (type="ACCURACY")
  • Security or safety aspects → create a Security & Safety test (type="SECURITY")
  • Conversational scenarios → create a Behavior test (type="BEHAVIOR")
You can create one or multiple test types for the same product, depending on which components you want to evaluate.
After creating a test, give it a moment to finish generating test cases. You can track progress in the dashboard.
If you already created a test during onboarding, you can reuse it here—skip the tests.create(...) step and just list its test cases using the test ID from the dashboard.
Quality tests are used to evaluate RAG pipelines and QA-style components within a conversational product. Galtea generates single-turn test cases (input and expected output) from a knowledge base to assess response quality and factual correctness.
test = galtea.tests.create(
    name="rag-accuracy-test",
    type="ACCURACY",
    product_id=product_id,
    ground_truth_file_path="path/to/knowledge.md",
    language="english",
    max_test_cases=20,
)
After a brief wait for test case generation, list the test cases:
test_cases = galtea.test_cases.list(test_id=test.id)
print(f"Using test '{test.name}' with {len(test_cases)} test cases.")

5. Choose Your Metrics

Metrics define how evaluation outputs are scored. Galtea provides built-in metrics for common use cases (factual accuracy, faithfulness, toxicity, role adherence, and many more), and you can also create custom metrics with your own judge prompts.
For a RAG-style Quality test, a good default is Factual Accuracy.
metric = galtea.metrics.get_by_name(name="Factual Accuracy")

6. Run the Evaluation

Now connect Galtea to your product. You define an agent function — a wrapper that takes a test case input, calls your AI system, and returns its output. The SDK then loops through test cases, calls your agent, and scores the results via evaluations. The SDK supports three agent function signatures — pick the one that matches how much context your agent needs:
The quickest way to get started. Your function receives just the latest user message as a string.
def my_agent(user_message: str) -> str:
    # In a real scenario, call your model here
    return f"Your model output to: {user_message}"
All three signatures work with generate() and simulate(). Both sync and async functions are supported. The SDK auto-detects which signature you’re using from the type hint on the first parameter.
for test_case in test_cases:
    # Create a session linked to the test case and version
    session = galtea.sessions.create(
        version_id=version_id,
        test_case_id=test_case.id,
    )

    # Run a synthetic user conversation against your agent
    inference_result = galtea.inference_results.generate(
        session=session,
        agent=my_agent,
        user_input=test_case.input,
    )

    # Evaluate the full conversation (session)
    galtea.evaluations.create(
        session_id=session.id,
        metrics=[{"name": metric.name}],
    )

print(f"Submitted evaluations for version {version_id} using test '{test.name}'.")
If you already computed a score, you can upload it directly: {"name": "my-metric", "score": 0.85}.

7. See Your Results

Once your evaluations are submitted, open your product in the Galtea platform and head to Analytics. Start with the aggregate scores to get a quick baseline, then slice by version, test, and metric to see where performance shifts. The most useful part is drilling into individual test cases: you can inspect the exact prompt/output pair (or the full conversation for scenarios) and understand which specific inputs move the score up or down. This makes it easier to identify failure patterns, validate fixes in new versions, and iterate with confidence as you move toward production.
print(f"View results at: https://platform.galtea.ai/product/{product_id}")

Next Steps

Congratulations! You’ve completed your first evaluation with Galtea. You can now explore more advanced features like:

Dive Deeper

Explore these concepts to tailor tests, metrics, and workflows to your product: If you have any questions or need assistance, contact us at support@galtea.ai.