Registering a product is the first step to start using the Galtea platform.
In this case, we don’t allow to create a product from the sdk, therefore you need to do this from the dashboard.
With a version, you can track changes to your product over time and compare different implementations.
Copy
# Create a versionversion = galtea.versions.create( name="v1.0", product_id=product.id, description="Initial version with basic summarization capabilities")
More information about creating a version can be found here.
# Create a testtest = galtea.tests.create( name="example-test-tutorial", type="QUALITY", product_id=product.id, ground_truth_file_path="path/to/knowledge_file.pdf", # The ground truth is also known as as the knowledge base. language='spanish')
More information about creating a test can be found here.
Metric Types help you define the criteria for evaluating the performance of your product.
You can create custom Metric Types tailored to your specific use cases.
Copy
# Create a standard metric type via APImetric_type_from_api = galtea.metrics.create( name="accuracy_v1", evaluator_model_name="GPT-4.1", criteria="Determine whether the 'actual output' is equivalent to the 'expected output'." evaluation_params=["input", "expected_output", "actual_output"],)# Or define a custom metric locally for deterministic checksfrom galtea import CustomScoreMetric# First, it needs to be created in the platformmetric_type_from_api = galtea.metrics.create( name="keyword-check", description="Checks if the 'actual output' contains the keyword 'expected'.",)# Then, you can define your custom metric classclass MyKeywordMetric(CustomScoreMetric): def __init__(self): super().__init__(name="keyword-check") def measure(self, *args, actual_output: str | None = None, **kwargs) -> float: """ Returns 1.0 if 'expected' is in actual_output, else 0.0. All other args/kwargs are accepted but ignored. """ if actual_output is None: return 0.0 return 1.0 if "expected" in actual_output else 0.0keyword_metric = MyKeywordMetric()
More information about creating a metric can be found here.
# Get test cases from the testtest_cases = galtea.test_cases.list(test_id=test.id)# Run evaluation tasks for each test casefor test_case in test_cases: # Retrieve relevant context for RAG. This may not apply to all products. retrieval_context = your_retriever_function(test_case.input) # Your product's actual response to the input actual_output = your_product_function(test_case.input, test_case.context, retrieval_context) # Run evaluation task using both standard and custom metrics galtea.evaluation_tasks.create_single_turn( version_id=version.id, test_case_id=test_case.id, metrics=[ metric_type_from_api.name, # Standard metric by name keyword_metric # Custom metric object ], actual_output=actual_output, retrieval_context=retrieval_context, )
More information about launching evaluation tasks can be found here.
Once the evaluation tasks are completed, you can retrieve them to analyze the results.
Copy
# First, get all evaluations for the productevaluations = galtea.evaluations.list(product_id=product.id)# Iterate through all product evaluationsfor evaluation in evaluations: # Retrieve the evaluation tasks evaluation_tasks = galtea.evaluation_tasks.list(evaluation_id=evaluation.id) # Print the evaluation tasks's ID and score for task in evaluation_tasks: print(f"Task ID: {task.id}, Score: {task.score}")