Returns

Returns a list of EvaluationTask objects, one for each metric provided.

Usage

This method is versatile and can be used for two main scenarios:
  1. Test-Based Evaluation: When you provide a test_case_id, Galtea evaluates your product’s performance against a predefined challenge.
  2. Production Monitoring: When you set is_production=True and provide an input, Galtea logs and evaluates real user interactions.

Example: Test-Based Evaluation with Standard and Custom Metrics

from galtea import Galtea, CustomScoreEvaluationMetric

# Define your custom metric
class MyAccuracy(CustomScoreEvaluationMetric):
    def __init__(self):
        super().__init__(name="my-accuracy")
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        """
        Returns 1.0 if 'paris' is in actual_output (case-insensitive), else 0.0.
        All other args/kwargs are accepted but ignored.
        """
        if actual_output is None:
            return 0.0
        return 1.0 if "paris" in actual_output.lower() else 0.0

custom_score_accuracy = MyAccuracy()

# If the metric does not exist yet in the platform, you can create it like this:
galtea.metrics.create(name=custom_score_accuracy.name, description='checks for the presence of the keyword "paris" in the output')

# Create evaluation tasks using a mix of standard and custom metrics
evaluation_tasks = galtea.evaluation_tasks.create_single_turn(
    version_id="YOUR_VERSION_ID",
    test_case_id="YOUR_TEST_CASE_ID",
    actual_output="The capital of France is Paris.",
    metrics=[
        "coherence",            # Standard metric
        custom_score_accuracy   # Your custom metric object
    ]
)

Example: Production Monitoring

evaluation_tasks = galtea.evaluation_tasks.create_single_turn(
    version_id="YOUR_VERSION_ID",
    is_production=True,
    input="A real user's query from production",
    actual_output="The model's live response.",
    metrics=["relevance", "non-toxic"],
    retrieval_context="Context retrieved by RAG for this query."
)

Parameters

version_id
string
required
The ID of the version you want to evaluate.
metrics
List[string | CustomScoreEvaluationMetric]
required
A list of metrics to use for evaluation. You can provide:
  • Standard metrics as strings (e.g., “coherence”).
  • Custom, locally-scored metrics as objects inheriting from CustomScoreEvaluationMetric.
actual_output
string
required
The actual output produced by the product.
test_case_id
string
The ID of the test case to be evaluated. Required for non-production evaluations.
input
string
The input text/prompt. Required for production evaluations where no test_case_id is provided.
is_production
boolean
Set to True to indicate the task is from a production environment. Defaults to False.
retrieval_context
string
The context retrieved by your RAG system that was used to generate the actual_output.
latency
float
Time in milliseconds from the request to the LLM until the response was received.
usage_info
dict[str, float]
Token usage information (e.g., {"input_tokens": 10, "output_tokens": 5}).
cost_info
dict[str, float]
Cost information for the LLM call.