Galtea allows you to provide your own pre-calculated scores for evaluation tasks. This is particularly useful for:

  • Deterministic Metrics: When you have custom, rule-based logic to score outputs (e.g., checking for specific keywords, validating JSON structure).
  • Integrating External Models: When you use your own models for evaluation and want to store their scores in Galtea.

Single-Turn Evaluation with Custom Scores

For individual test cases or production logs, you can provide scores directly to the create_single_turn method.

from galtea import Galtea
import os

galtea = Galtea(api_key=os.getenv("GALTEA_API_KEY"))

# --- Configuration ---
VERSION_ID = "your_version_id"
TEST_CASE_ID = "your_test_case_id"

# --- Your Custom Scoring Logic (Placeholders) ---
def get_score_accuracy(output):
    return 1.0 if "correct keyword" in output else 0.0

def get_score_relevance(output):
    return 0.8 # Placeholder for a more complex relevance score

# Your product's response
actual_output = "This response contains the correct keyword."

# Run evaluation task with your custom scores
# The scores in the list correspond to the metrics in the `metrics` list
galtea.evaluation_tasks.create_single_turn(
    version_id=VERSION_ID,
    test_case_id=TEST_CASE_ID,
    actual_output=actual_output,
    metrics=["custom-accuracy", "custom-relevance"],
    scores=[get_score_accuracy(actual_output), get_score_relevance(actual_output)],
)

print("Evaluation task with custom scores submitted.")

Multi-Turn Conversation with Custom Scores

When evaluating a multi-turn conversation, you can provide per-turn custom scores by looping through the conversation and creating a single-turn task for each interaction.

This approach is recommended for multi-turn custom scoring as it provides turn-level granularity. While evaluation_tasks.create(session_id=...) accepts a scores parameter, it applies the same score list to every turn, which is usually not desired.

from galtea import Galtea
import os

galtea = Galtea(api_key=os.getenv("GALTEA_API_KEY"))

# --- Configuration ---
VERSION_ID = "your_version_id"

# 1. Create a session to contain the conversation
session = galtea.sessions.create(version_id=VERSION_ID, is_production=True)

# 2. Your conversation data and custom scoring logic
conversation_turns = [
    {"input": "What is the order status?", "output": "Your order #123 has shipped."},
    {"input": "When will it arrive?", "output": "It is expected to arrive in 3-5 business days."}
]

def score_turn_relevance(turn_output):
    # Your relevance logic here
    return 0.9 if "order" in turn_output else 0.2

# 3. Log each turn and create an evaluation task with its custom score
for turn in conversation_turns:
    # Log the inference result to the session
    galtea.inference_results.create(
        session_id=session.id,
        input=turn["input"],
        output=turn["output"]
    )
    
    # Calculate custom score for this specific turn
    relevance_score = score_turn_relevance(turn["output"])
    
    # Create a single-turn evaluation task for this turn
    galtea.evaluation_tasks.create_single_turn(
        version_id=VERSION_ID,
        # Link to production data via input/output
        input=turn["input"],
        actual_output=turn["output"],
        metrics=["custom-turn-relevance"],
        scores=[relevance_score],
        is_production=True
    )

print(f"Logged and evaluated {len(conversation_turns)} turns for session {session.id} with custom scores.")