Evaluation Workflow Overview
1
Tell Us About Your Product
Provide details about your product so our models can create tailored testing content.
2
Install SDK & Connect
Set up the Galtea Python SDK to interact with the platform.
3
Version Your Product
Track different implementations to compare improvements over time.
4
Create Your Tests
Build test datasets with input and ground truth pairs to challenge your product.
5
Choose Your Metrics
Select from our metrics library or bring your own custom metrics.
6
Run Evaluations
Test your product version with the selected test and metrics.
7
See Your Results
Explore detailed insights and compare versions on the dashboard.
1. Tell Us About Your Product
To test your product effectively, our state-of-the-art LLMs and other models need to understand it inside and out. The more information you provide, the better equipped our models will be to create unique and highly specific testing content tailored just for your product. Create a product in the Galtea dashboard. Navigate to Products > Create New Product and complete the form.The product description is crucial – it powers our ability to generate synthetic test data that’s specific to your use case.
2. Get Started with the SDK
Just like with many other AI tools, you have options! You can do everything through our intuitive dashboard GUI, or you can get programmatic with our Python SDK. For this quickstart, we’ll focus on using the Python SDK.1
Get your API key
In the Galtea dashboard, navigate to Settings > Generate API Key and copy your key.
2
Install the SDK
3
Connect to the platform
3. Version Your Product
Imagine this: Galtea shows you that your product doesn’t perform perfectly on certain edge cases you hadn’t anticipated. Naturally, you go back, make some tweaks, and now you have a new, improved version! You test your new version again on Galtea. But how do you compare if the new model is better than the previous one? Galtea provides a robust way to version your product – attach a version name, ID, and metadata to each of your product/model versions. This means you can easily compare the results of different versions, filter your findings, and clearly see if your new version is, in fact, an improvement over the old one. Create a version to track a specific implementation of your product.4. Create Your Tests
This is where we build the core of your evaluation: the test dataset. This dataset consists of input and ground truth pairs. We’ll feed the input to your product/model, and then compare its prediction against the ground truth to see how well it performed. We offer two main ways to generate this crucial test data:- Using Your Product Description: If, for example, your product is a RAG system designed for financial data, we’ll automatically generate a test dataset that a RAG system for financial data should excel at.
- Using Your Specific Data: Do you have very particular PDFs or text files that your product absolutely must handle correctly? No problem! You can upload these files, and we’ll generate a test dataset directly from them, ensuring your product is tested on your critical content.
5. Choose Your Metrics
Anyone familiar with machine learning knows that after comparing predictions to ground truth, you get a “loss,” but you also need a meaningful metric to interpret that loss. This step is all about selecting the right metrics for your evaluation. You can bring your own custom metrics to the table, or you can choose from our wide array of available metrics (like QA, text-specific metrics, IOU, and many more). To evaluate the “Jailbreak” test, we’ll use the “Jailbreak Resilience” metric.6. Run the Evaluation
This step is exactly what it sounds like! We take each test case from your created tests, pass its input through your product/model, compare the output to the ground truth, and calculate the metric values. You’ll get detailed metric values for every single test case within your tests. Now, run evaluations against your test cases.In a real scenario,
your_product_function would be a call to your actual AI model.7. See Your Results
Once you’ve run evaluations across all your tests and thoroughly tested your product/model, it’s time to dive into the insights! You can view all your results on the Galtea platform in a highly intuitive way. Our dashboard allows you to filter and explore your data however you like – by versions, metrics, specific tests, and more. You can see aggregate scores for an overall picture or drill down into the results of individual test cases. Navigate to your product’s “Analytics” tab to see detailed analysis and compare versions.Next Steps
Congratulations! You’ve completed your first evaluation with Galtea using default assets. This is just the beginning. Explore these concepts to tailor Galtea to your specific needs:Product
A functionality or service being evaluated
Version
A specific iteration of a product
Test
A set of test cases for evaluating product performance
Session
A full conversation between a user and an AI system.
Inference Result
A single turn in a conversation between a user and the AI.
Evaluation
The assessment of an evaluation using a specific metric’s criteria
Metric
Ways to evaluate and score product performance
Model
Way to keep track of your models’ costs