Evaluations allow you to assess how well a specific version of your product performs against a set of test cases by running individual evaluation tasks.

Creating an Evaluation

This is how to create an evaluation:

from galtea import Galtea

# Initialize Galtea SDK
galtea = Galtea(api_key="YOUR_API_KEY")

# Create an evaluation
evaluation = galtea.evaluations.create(
    test_id="YOUR_TEST_ID",
    version_id="YOUR_VERSION_ID"
)

print(f"Evaluation created with ID: {evaluation.id}")

An evaluation links a specific version of your product to a test. This establishes the framework for running individual evaluation tasks.

Running Evaluation Tasks

Once you’ve created an evaluation, you can run evaluation tasks:

For efficiency, you can process multiple evaluation tasks at once using a loop and the galtea.evaluation_tasks.create method:

# Load your test cases
test_cases = galtea.test_cases.list(test_id="YOUR_TEST_ID")

# Evaluate all test cases
for test_case in test_cases:
    # Retrieve relevant context for RAG. This may not apply to all products.
    retrieval_context = your_retriever_function(test_case.input)
    
    # Your product's actual response to the input
    timeBeforeCall = datetime.now()
    response_data = your_product_function(test_case.input, test_case.context, retrieval_context)
    timeAfterCall = datetime.now()
    
    # Run evaluation task
    galtea.evaluation_tasks.create(
        metrics=["accuracy_v1", "relevance_v1", "context_accuracy"],
        evaluation_id=evaluation.id,
        test_case_id=test_case.id,
        actual_output=response_data.output, # Assuming response_data has an 'output' attribute
        retrieval_context=retrieval_context, # Should be a single string if applicable
        ###################### Optional Parameters ######################
        latency=(timeAfterCall - timeBeforeCall).total_seconds() * 1000,
        usage_info={
          "input_tokens": response_data.input_tokens,
          "output_tokens": response_data.output_tokens,
          "cache_read_input_tokens": response_data.cache_read_input_tokens,
        },
        cost_info={
          "cost_per_input_token": 0.00002,
          "cost_per_output_token": 0.00005,
          "cost_per_cache_read_input_token": 0.00001,
        },
        # scores=[0.8, None, 0.95] # Example: score for accuracy_v1, platform evaluates relevance_v1, score for context_accuracy
        #################################################################
    )

print("All evaluation tasks submitted")

The metrics parameter specifies which metric types to use for evaluating the task. You can use multiple metrics simultaneously to get different perspectives on performance.

The latency and usage_info parameters are optional but highly recommended to be used. They can be used to track the performance in latency and costs of your product.

Directly providing input, expected_output, or context to galtea.evaluation_tasks.create when a test_case_id is used is deprecated. These parameters will be sourced from the test case. This approach limits analytics and consistency.

Customized Test Cases for Evaluation Tasks

If you want to fully customize the Evaluation Task, maybe because your Test File does not follow Galtea’s format thus the test cases have not been created, you can load the test cases from the file and set them in the evaluation task in a similar way:

import pandas as pd
# Load your test data
test = galtea.tests.get(test_id="YOUR_TEST_ID")
test_file_path = galtea.tests.download(test, "path/to/your/directory")
df = pd.read_csv(json_file_path)

# Process all test instances
for index, row in df.iterrows():
    # Retrieve relevant context for RAG
    retrieval_context = your_retriever_function(row['input'])
    
    # Your product's actual response to the input
    actual_output = your_product_function(row['input'], retrieval_context)
    
    # Run evaluation task
    galtea.evaluation_tasks.create(
        metrics=["accuracy_v1", "relevance_v1", "context_accuracy"],
        evaluation_id=evaluation.id,
        input=row['input'],
        actual_output=actual_output,
        expected_output=row['expected_output'],
        context=row['context'],
        retrieval_context=retrieval_context,
    )
print("All evaluation tasks submitted")

This method is only recommended for advanced users who need to customize the evaluation tasks. For most cases, using the test cases provided by the test is the best approach.

Note that providing input, expected_output, and context directly in evaluation_tasks.create (when a test_case_id would normally be used for test-based evaluations) is a legacy approach and might be deprecated in future SDK versions in favor of always linking to a test_case_id for formal tests.

Using this will limit the access to analysis tools and metrics that are available when using the standard test cases.

Retrieving Evaluation Results

After running evaluation tasks, you can retrieve the results:

# Get all tasks for an evaluation
tasks = galtea.evaluation_tasks.list(evaluation_id=evaluation.id)

# Analyze results
total_score = 0
for task in tasks:
    task_detail = galtea.evaluation_tasks.get(task.id)
    total_score += task_detail.score
    print(f"Task {task.id}: Score = {task_detail.score}")

avg_score = total_score / len(tasks) if tasks else 0
print(f"Average evaluation score: {avg_score}")

Next Steps