Creating an Evaluation
Evaluations allow you to assess how well a specific version of your product performs against a set of test cases by running individual evaluation tasks.
This is how to create an evaluation:
from galtea import Galtea
# Initialize Galtea SDK
galtea = Galtea(api_key="YOUR_API_KEY")
# Create an evaluation
evaluation = galtea.evaluations.create(
test_id="YOUR_TEST_ID",
version_id="YOUR_VERSION_ID"
)
print(f"Evaluation created with ID: {evaluation.id}")
An evaluation links a specific version of your product to a test. This establishes the framework for running individual evaluation tasks.
Running Evaluation Tasks with Self-Calculated Scores
Once you’ve created an evaluation, you can run evaluation tasks and directly assign a self-calculated score:
For efficiency, you can process multiple evaluation tasks using a loop and the galtea.evaluation_tasks.create
method:
# Load your test cases
test_cases = galtea.test_cases.list(test_id="YOUR_TEST_ID")
# Evaluate all test cases
for test_case in test_cases:
# Retrieve relevant context for RAG. This may not apply to all products.
retrieval_context = your_retriever_function(test_case.input)
# Your product's actual response to the input
timeBeforeCall = datetime.now()
response = your_product_function(test_case.input, test_case.context, retrieval_context)
timeAfterCall = datetime.now()
# Run evaluation task
galtea.evaluation_tasks.create(
evaluation_id=evaluation.id,
test_case_id=test_case.id,
actual_output=response.output,
retrieval_context=retrieval_context,
latency=(timeAfterCall - timeBeforeCall).total_seconds() * 1000,
usage_info={
"input_tokens": response.inputTokens,
"output_tokens": response.outputTokens,
"cache_read_input_tokens": response.cacheReadInputTokens,
},
metrics=["metric_accuracy", "metric_relevance"],
# Your functions to get a score between 0.0 and 1.0 based on your criteria.
scores=[get_score_accuracy(response.output), get_score_relevance(response.output)],
)
The metrics
parameter specifies which metric types to use for evaluating the task. You can use multiple metrics simultaneously to get different perspectives on performance.