Evaluation Workflow Overview
1
Create a Product
Define what functionality or service you want to evaluate on the Galtea platform.
2
Install SDK & Connect
Set up the Galtea Python SDK to interact with the platform.
3
Register a Version
Document a specific implementation of your product using the SDK.
4
Select a Test
Use a default Galtea test (or create your own) to challenge your product.
5
Select a Metric
Use a default Galtea metric (or define your own) to evaluate your product’s version.
6
Run Evaluations
Test your product version with the selected test and metrics, then analyze results.
1. Create a Product
Create a product in the Galtea dashboard. Navigate to Products > Create New Product and complete the form.The product description is important as it may be used to generate synthetic test data.
2. Install the SDK and Connect
1
Get your API key
In the Galtea dashboard, navigate to Settings > Generate API Key and copy your key.
2
Install the SDK
3
Connect to the platform
3. Create a Version
Create a version to track a specific implementation of your product.4. Use a Default Test
For this quickstart, we’ll use the default “Jailbreak” test, which is a type of Red Teaming Test.5. Use a Default Metric
To evaluate the “Jailbreak” test, we’ll use the “Jailbreak Resilience” metric.6. Run Evaluations
Now, run an evaluation by creating evaluation tasks.In a real scenario,
your_product_function
would be a call to your actual AI model.7. View Results
You can view results on the Galtea dashboard. Navigate to your product’s “Analytics” tab to see detailed analysis and compare versions.Next Steps
Congratulations! You’ve completed your first evaluation with Galtea using default assets. This is just the beginning. Explore these concepts to tailor Galtea to your specific needs:Product
A functionality or service being evaluated
Version
A specific iteration of a product
Test
A set of test cases for evaluating product performance
Session
A full conversation between a user and an AI system.
Inference Result
A single turn in a conversation between a user and the AI.
Evaluation
A group of evaluable Inference Results from a particular session
Evaluation Task
The assessment of an evaluation using a specific metric type’s criteria
Metric Type
Ways to evaluate and score product performance
Model
Way to keep track of your models’ costs