An evaluation in Galtea is a group of Inference Results from a particular session. It serves as the container for all the evaluation tasks that assess how well the product version performs.
Evaluation tasks don’t perform inference on the LLM product themselves. Rather, they group outputs that have already been generated. You should perform inference on your product first, then create the evaluation task.
The core components of an evaluation are its evaluation tasks. Each task represents the assessment of the evaluation using a specific metric type.
An evaluation is created implicitly when you create an evaluation task for a specific session. In this case, the evaluation will be linked to all Inference Results from that session.
Once you’ve created an evaluation, you can access detailed information and results on the dashboard.
To view evaluation results, you need to visit a product’s Analytics section. For detailed information about a particular evaluation, you can navigate to the Evaluation Tasks tab.
The platform provides:
Overview of a product’s evaluation results per metric
Analytics comparing different versions of the product
The Galtea SDK allows you to view and manage evaluations programmatically. This is particularly useful for organizations that want to automate their evaluation process or integrate it into their CI/CD pipeline.
The list of inference results that belong to this evaluation. Each inference result is a single output generated by the product version during the session.