Skip to main content
This quickstart walks you through running your first evaluation in Galtea for a conversational product using the Python SDK. It is intended for a technical audience and covers RAG pipelines, conversational scenarios, and security or safety checks.

Evaluation Workflow Overview

1

Tell Us About Your Product

Provide details about your product so our models can create tailored testing content.
2

Install SDK & Connect

Set up the Galtea Python SDK to interact with the platform.
3

Version Your Product

Track different implementations to compare improvements over time.
4

Create Your Tests

Create a test dataset to generate the test cases you’ll evaluate.
5

Choose Your Metrics

Select from our metrics library or bring your own custom metrics.
6

Run Evaluations

Test your product version with the selected test and metrics.
7

See Your Results

Explore detailed insights and compare versions on the dashboard.

1. Tell Us About Your Product

If you’re new to Galtea, start by creating an account and your first product in the dashboard: Go to the dashboard. Follow the Onboarding checklist on the product page (this is the guided setup shown inside your product). It will prompt you for product details and help you generate an API key. The product details you provide there are used to tailor generated tests and evaluations. Keep your Product ID and API key handy for the SDK steps below.
The product information is crucial – it powers our ability to generate synthetic test data that’s specific to your use case.

2. Get Started with the SDK

You can use Galtea from the dashboard or via the Python SDK. This quickstart uses the Python SDK.
1

Get your API key

In the Galtea dashboard, copy the API key you created during your product page onboarding (or navigate to Settings > Generate API Key).
2

Install the SDK

pip install galtea
3

Connect to the platform

from galtea import Galtea

# Verify the SDK is properly installed by initializing it
# Note: Replace "YOUR_API_KEY" with your actual API key from the Galtea dashboard
galtea = Galtea(api_key="YOUR_API_KEY")
print("Galtea SDK installed successfully!")

3. Version Your Product

Use versions to track iterations of your product and compare evaluation results over time. Each evaluation is tied to a version, so you can see how changes impact metrics. If you completed the Onboarding checklist on your product page, you already have an initial version (created automatically). You can find the IDs in the dashboard (open your product and copy the Product ID and Version ID, you can find them by clicking on the 3 dots and select ‘Copy ID’), or list them via the SDK:
products = galtea.products.list()
for p in products:
    print(p.id, p.name)

versions = galtea.versions.list(product_id=products[0].id)
for v in versions:
    print(v.id, v.name)
Then set them once here:
product_id = "your_product_id"
version_id = "your_version_id"

4. Create Your Tests

A single conversational product typically includes multiple components. In Galtea, you evaluate each component by creating a different test type, think of each test as a lens on the same system:
  • RAG pipelines → create a Quality test (type="QUALITY")
  • Security or safety aspects → create a Red Teaming test (type="RED_TEAMING")
  • Conversational scenarios → create a Scenarios test (type="SCENARIOS")
You can create one or multiple test types for the same product, depending on which components you want to evaluate.
After creating a test, give it a moment to finish generating test cases. You can track progress in the dashboard.
If you already created a test during onboarding, you can reuse it here—skip the tests.create(...) step and just list its test cases using the test ID from the dashboard.
Quality tests are used to evaluate RAG pipelines and QA-style components within a conversational product. Galtea generates single-turn test cases (input and expected output) from a knowledge base to assess response quality and factual correctness.
test = galtea.tests.create(
    name="rag-quality-test",
    type="QUALITY",
    product_id=product_id,
    ground_truth_file_path="path/to/knowledge.md",
    language="english",
    max_test_cases=20,
)
After a brief wait for test case generation, list the test cases:
test_cases = galtea.test_cases.list(test_id=test.id)
print(f"Using test '{test.name}' with {len(test_cases)} test cases.")

5. Choose Your Metrics

This step is about choosing what you want to measure. In Galtea, metrics score each evaluation output (often using LLM-as-a-judge techniques for non-deterministic criteria), so you can quantify things like quality, correctness, or safety behavior across your test cases. You can bring your own custom metrics to the table, or you can choose from our wide array of available metrics (like QA, text-specific metrics, IOU, and many more).
For a RAG-style Quality test, a good default is Factual Accuracy.
metric = galtea.metrics.get_by_name(name="Factual Accuracy")

6. Run the Evaluation

This is where you connect Galtea to your product. Then, for each test case (or scenario), you run your system to produce an output and submit it to Galtea for scoring via evaluations. To do so, you first need to implement an Agent that translates the Galtea information into calls to your product. Here’s a simple example agent:
class MyAgent(Agent):
    def call(self, input_data: AgentInput) -> AgentResponse:
        user_message = input_data.last_user_message_str()
        # In a real scenario, you woyuld call your agent here, e.g., your_model_output = your_product_function(user_message)
        model_output = f"Your model output to the {user_message}"
        return AgentResponse(content=model_output)
for test_case in test_cases:
  # Create a session linked to the test case and version
  session = galtea.sessions.create(
      version_id=version_id,
      test_case_id=test_case.id,
  )

  # Run a synthetic user conversation against your agent
  inference_result = galtea.inference_results.generate(
      session=session,
      agent=MyAgent(),
      user_input=test_case.input,
  )

  # Evaluate the full conversation (session)
  galtea.evaluations.create(
      session_id=session.id,
      metrics=[{"name": metric.name}],
  )

print(f"Submitted evaluations for version {version_id} using test '{test.name}'.")
If you already computed a score, you can upload it directly: {"name": "my-metric", "score": 0.85}.

7. See Your Results

Once your evaluations are submitted, open your product in the Galtea platform and head to Analytics. Start with the aggregate scores to get a quick baseline, then slice by version, test, and metric to see where performance shifts. The most useful part is drilling into individual test cases: you can inspect the exact prompt/output pair (or the full conversation for scenarios) and understand which specific inputs move the score up or down. This makes it easier to identify failure patterns, validate fixes in new versions, and iterate with confidence as you move toward production.
print(f"View results at: https://platform.galtea.ai/product/{product_id}")

Next Steps

Congratulations! You’ve completed your first evaluation with Galtea. You can now explore more advanced features like:

Dive Deeper

Explore these concepts to tailor tests, metrics, and workflows to your product: If you have any questions or need assistance, contact us at [email protected].