What is an Evaluation?

An evaluation in Galtea is a group of Inference Results from a particular session. It serves as the container for all the evaluation tasks that assess how well the product version performs.

Evaluation tasks don’t perform inference on the LLM product themselves. Rather, they group outputs that have already been generated. You should perform inference on your product first, then create the evaluation task.

Evaluation Tasks

The core components of an evaluation are its evaluation tasks. Each task represents the assessment of the evaluation using a specific metric type.

An evaluation is created implicitly when you create an evaluation task for a specific session. In this case, the evaluation will be linked to all Inference Results from that session.

Evaluation Task

Learn more about evaluation tasks

Results Visualization

Once you’ve created an evaluation, you can access detailed information and results on the dashboard.

To view evaluation results, you need to visit a product’s Analytics section. For detailed information about a particular evaluation, you can navigate to the Evaluation Tasks tab.

The platform provides:

  • Overview of a product’s evaluation results per metric
  • Analytics comparing different versions of the product
  • Detailed view of individual evaluation tasks

SDK Integration

The Galtea SDK allows you to view and manage evaluations programmatically. This is particularly useful for organizations that want to automate their evaluation process or integrate it into their CI/CD pipeline.

Evaluation Service SDK

Manage evaluations using the Python SDK

Evaluation Properties

Inference Results
InferenceResults[]
required

The list of inference results that belong to this evaluation. Each inference result is a single output generated by the product version during the session.

Session
Session
required

The session that all the inference results belong to.