Creating an Evaluation

Evaluations allow you to assess how well a specific version of your product performs against a set of test cases by running individual evaluation tasks.

This is how to create an evaluation:

from galtea import Galtea

# Initialize Galtea SDK
galtea = Galtea(api_key="YOUR_API_KEY")

# Create an evaluation
evaluation = galtea.evaluations.create(
    test_id="YOUR_TEST_ID",
    version_id="YOUR_VERSION_ID"
)

print(f"Evaluation created with ID: {evaluation.id}")

An evaluation links a specific version of your product to a test. This establishes the framework for running individual evaluation tasks.

Running Evaluation Tasks with Self-Calculated Scores

Once you’ve created an evaluation, you can run evaluation tasks and directly assign a self-calculated score:

For efficiency, you can process multiple evaluation tasks using a loop and the galtea.evaluation_tasks.create method:

# Load your test cases
test_cases = galtea.test_cases.list(test_id="YOUR_TEST_ID")

# Evaluate all test cases
for test_case in test_cases:
    # Retrieve relevant context for RAG. This may not apply to all products.
    retrieval_context = your_retriever_function(test_case.input)
    
    # Your product's actual response to the input
    timeBeforeCall = datetime.now()
    response = your_product_function(test_case.input, test_case.context, retrieval_context)
    timeAfterCall = datetime.now()

    # Run evaluation task
    galtea.evaluation_tasks.create(
        evaluation_id=evaluation.id,
        test_case_id=test_case.id,
        actual_output=response.output,
        retrieval_context=retrieval_context,

        latency=(timeAfterCall - timeBeforeCall).total_seconds() * 1000,
        usage_info={
          "input_tokens": response.inputTokens,
          "output_tokens": response.outputTokens,
          "cache_read_input_tokens": response.cacheReadInputTokens,
        },

        metrics=["metric_accuracy", "metric_relevance"],
        # Your functions to get a score between 0.0 and 1.0 based on your criteria. 
        scores=[get_score_accuracy(response.output), get_score_relevance(response.output)],
    )

The metrics parameter specifies which metric types to use for evaluating the task. You can use multiple metrics simultaneously to get different perspectives on performance.