Evaluation Workflow Overview
1
Tell Us About Your Product
Provide details about your product so our models can create tailored testing content.
2
Install SDK & Connect
Set up the Galtea Python SDK to interact with the platform.
3
Version Your Product
Track different implementations to compare improvements over time.
4
Create Your Tests
Create a test dataset to generate the test cases you’ll evaluate.
5
Choose Your Metrics
Select from our metrics library or bring your own custom metrics.
6
Run Evaluations
Test your product version with the selected test and metrics.
7
See Your Results
Explore detailed insights and compare versions on the dashboard.
1. Tell Us About Your Product
If you’re new to Galtea, start by creating an account and your first product in the dashboard: Go to the dashboard. Follow the Onboarding checklist on the product page (this is the guided setup shown inside your product). It will prompt you for product details and help you generate an API key. The product details you provide there are used to tailor generated tests and evaluations. Keep your Product ID and API key handy for the SDK steps below.The product information is crucial – it powers our ability to generate synthetic test data that’s specific to your use case.
2. Get Started with the SDK
You can use Galtea from the dashboard or via the Python SDK. This quickstart uses the Python SDK.1
Get your API key
In the Galtea dashboard, copy the API key you created during your product page onboarding (or navigate to Settings > Generate API Key).
2
Install the SDK
3
Connect to the platform
3. Version Your Product
Use versions to track iterations of your product and compare evaluation results over time. Each evaluation is tied to a version, so you can see how changes impact metrics. If you completed the Onboarding checklist on your product page, you already have an initial version (created automatically). You can find the IDs in the dashboard (open your product and copy the Product ID and Version ID, you can find them by clicking on the 3 dots and select ‘Copy ID’), or list them via the SDK:4. Create Your Tests
A single conversational product typically includes multiple components. In Galtea, you evaluate each component by creating a different test type, think of each test as a lens on the same system:- RAG pipelines → create a Quality test (
type="QUALITY") - Security or safety aspects → create a Red Teaming test (
type="RED_TEAMING") - Conversational scenarios → create a Scenarios test (
type="SCENARIOS")
After creating a test, give it a moment to finish generating test cases. You can track progress in the dashboard.
If you already created a test during onboarding, you can reuse it here—skip the
tests.create(...) step and just list its test cases using the test ID from the dashboard.- RAG
- Security
- Conversational
Quality tests are used to evaluate RAG pipelines and QA-style components within a conversational product. Galtea generates single-turn test cases (input and expected output) from a knowledge base to assess response quality and factual correctness.After a brief wait for test case generation, list the test cases:
5. Choose Your Metrics
This step is about choosing what you want to measure. In Galtea, metrics score each evaluation output (often using LLM-as-a-judge techniques for non-deterministic criteria), so you can quantify things like quality, correctness, or safety behavior across your test cases. You can bring your own custom metrics to the table, or you can choose from our wide array of available metrics (like QA, text-specific metrics, IOU, and many more).- RAG
- Security
- Conversational
For a RAG-style Quality test, a good default is Factual Accuracy.
6. Run the Evaluation
This is where you connect Galtea to your product. Then, for each test case (or scenario), you run your system to produce an output and submit it to Galtea for scoring via evaluations. To do so, you first need to implement anAgent that translates the Galtea information into calls to your product. Here’s a simple example agent:
- RAG
- Security
- Conversational
If you already computed a score, you can upload it directly:
{"name": "my-metric", "score": 0.85}.7. See Your Results
Once your evaluations are submitted, open your product in the Galtea platform and head to Analytics. Start with the aggregate scores to get a quick baseline, then slice by version, test, and metric to see where performance shifts. The most useful part is drilling into individual test cases: you can inspect the exact prompt/output pair (or the full conversation for scenarios) and understand which specific inputs move the score up or down. This makes it easier to identify failure patterns, validate fixes in new versions, and iterate with confidence as you move toward production.Next Steps
Congratulations! You’ve completed your first evaluation with Galtea. You can now explore more advanced features like:Tracing Agent Operations
Learn how to capture and analyze the internal operations of your AI agent.
Evaluating Production
Learn how to log and evaluate user queries from your production environment.
Dive Deeper
Explore these concepts to tailor tests, metrics, and workflows to your product:Product
A functionality or service being evaluated
Version
A specific iteration of a product
Test
A set of test cases for evaluating product performance
Session
A full conversation between a user and an AI system.
Inference Result
A single turn in a conversation between a user and the AI.
Evaluation
The assessment of an evaluation using a specific metric’s criteria