Evaluations allow you to assess how well a specific version of your product performs against a set of test cases by running individual evaluation tasks.

Creating an Evaluation

This is how to create an evaluation:

from galtea import Galtea

# Initialize Galtea SDK
galtea = Galtea(api_key="YOUR_API_KEY")

# Create an evaluation
evaluation = galtea.evaluations.create(
    test_id="YOUR_TEST_ID",
    version_id="YOUR_VERSION_ID"
)

print(f"Evaluation created with ID: {evaluation.id}")

An evaluation links a specific version of your product to a test. This establishes the framework for running individual evaluation tasks.

Running Evaluation Tasks

Once you’ve created an evaluation, you can run evaluation tasks:

For efficiency, you can process multiple evaluation tasks at once using a loop and the galtea.evaluation_tasks.create method:

# Load your test cases
test_cases = galtea.test_cases.list(test_id="YOUR_TEST_ID")

# Evaluate all test cases
for test_case in test_cases:
    # Retrieve relevant context for RAG. This may not apply to all products.
    retrieval_context = your_retriever_function(test_case.input)
    
    # Your product's actual response to the input
    timeBeforeCall = datetime.now()
    response = your_product_function(test_case.input, test_case.context, retrieval_context)
    timeAfterCall = datetime.now()
    
    # Run evaluation task
    galtea.evaluation_tasks.create(
        metrics=["accuracy_v1", "relevance_v1", "context_accuracy"],
        evaluation_id=evaluation.id,
        test_case_id=test_case.id,
        actual_output=response.output,
        retrieval_context=[retrieval_context],  # Include retrieved context for evaluation
        ###################### Optional Parameters ######################
        latency=(timeAfterCall - timeBeforeCall).total_seconds() * 1000,
        usage_info={
          "input_tokens": response.inputTokens,
          "output_tokens": response.outputTokens,
          "cache_read_input_tokens": response.cacheReadInputTokens,
        },
        cost_info={
          "cost_per_input_token": 0.00002,
          "cost_per_output_token": 0.00005,
          "cost_per_cache_read_input_token": 0.00001,
        },
        #################################################################
    )

print("All evaluation tasks submitted")

The metrics parameter specifies which metric types to use for evaluating the task. You can use multiple metrics simultaneously to get different perspectives on performance.

The latency and usage_info parameters are optional but highly recommended to be used. They can be used to track the performance in latency and costs of your product.

Customized Test Cases for Evaluation Tasks

If you want to fully customize the Evaluation Task, maybe because your Test File does not follow Galtea’s format thus the test cases have not been created, you can load the test cases from the file and set them in the evaluation task in a similar way:

import pandas as pd
# Load your test data
test = galtea.tests.get(test_id="YOUR_TEST_ID")
test_file_path = galtea.tests.download(test, "path/to/your/directory")
df = pd.read_csv(json_file_path)

# Process all test instances
for index, row in df.iterrows():
    # Retrieve relevant context for RAG
    retrieval_context = your_retriever_function(row['input'])
    
    # Your product's actual response to the input
    actual_output = your_product_function(row['input'], retrieval_context)
    
    # Run evaluation task
    galtea.evaluation_tasks.create(
        metrics=["accuracy_v1", "relevance_v1", "context_accuracy"],
        evaluation_id=evaluation.id,
        input=row['input'],
        actual_output=actual_output,
        expected_output=row['expected_output'],
        context=row['context'],
        retrieval_context=[retrieval_context],  # Include retrieved context for evaluation
    )
print("All evaluation tasks submitted")

This method is only recommended for advanced users who need to customize the evaluation tasks. For most cases, using the test cases provided by the test is the best approach.

Using this will limit the acces to analysis tools and metrics that are available when using the standard test cases.

Retrieving Evaluation Results

After running evaluation tasks, you can retrieve the results:

# Get all tasks for an evaluation
tasks = galtea.evaluation_tasks.list(evaluation_id=evaluation.id)

# Analyze results
total_score = 0
for task in tasks:
    task_detail = galtea.evaluation_tasks.get(task.id)
    total_score += task_detail.score
    print(f"Task {task.id}: Score = {task_detail.score}")

avg_score = total_score / len(tasks) if tasks else 0
print(f"Average evaluation score: {avg_score}")

Next Steps