Evaluations allow you to assess how well a specific version of your product performs against a set of test cases by running individual evaluation tasks.

Creating an Evaluation

This is how to create an evaluation:

from galtea import Galtea

# Initialize Galtea SDK
galtea = Galtea(api_key="YOUR_API_KEY")

# Create an evaluation
evaluation = galtea.evaluations.create(
    test_id="YOUR_TEST_ID",
    version_id="YOUR_VERSION_ID"
)

print(f"Evaluation created with ID: {evaluation.id}")

An evaluation links a specific version of your product to a test. This establishes the framework for running individual evaluation tasks.

Running Evaluation For Custom metrics

Once you’ve created an evaluation, you can run evaluation tasks and directly assign a self-calculated score:

For efficiency, you can process multiple evaluation tasks at once using a loop and the galtea.evaluation_tasks.create method:

# Load your test cases
test_cases = galtea.test_cases.list(test_id="YOUR_TEST_ID")

# Evaluate all test cases
for test_case in test_cases:
    # Retrieve relevant context for RAG. This may not apply to all products.
    retrieval_context = your_retriever_function(test_case.input)
    
    # Your product's actual response to the input
    timeBeforeCall = datetime.now()
    response = your_product_function(test_case.input, test_case.context, retrieval_context)
    timeAfterCall = datetime.now()

    # Run evaluation task
    galtea.evaluation_tasks.create(
        metrics=["custom_metric_accuracy", "custom_metric_relevance"],
        evaluation_id=evaluation.id,
        test_case_id=test_case.id,
        actual_output=response.output,
        # Your functions to get a score between 0.0 and 1.0 based on your criteria. 
        scores=[get_score_accuracy(response.output), get_score_relevance(response.output)],
        #################################################################
        retrieval_context=[retrieval_context],
        ###################### Optional Parameters ######################
        latency=(timeAfterCall - timeBeforeCall).total_seconds() * 1000,
        usage_info={
          "input_tokens": response.inputTokens,
          "output_tokens": response.outputTokens,
          "cache_read_input_tokens": response.cacheReadInputTokens,
        },
        cost_info={
          "cost_per_input_token": 0.00002,
          "cost_per_output_token": 0.00005,
          "cost_per_cache_read_input_token": 0.00001,
        },
        #################################################################
    )

print("All evaluation tasks submitted")

The metrics parameter specifies which metric types to use for evaluating the task. You can use multiple metrics simultaneously to get different perspectives on performance.

The latency and usage_info parameters are optional but highly recommended to be used. They can be used to track the performance in latency and costs of your product.