Skip to main content

Returns

Returns a tuple containing:
  1. An InferenceResult object
  2. A list of Evaluation objects, one for each metric provided

Usage

This method combines creating an inference result with its evaluation in a single convenient call. It’s the recommended approach for single-turn evaluations, replacing the deprecated galtea.evaluations.create_single_turn() method.

Basic Example

# Create inference result and evaluate in a single call
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    metrics=["Factual Accuracy", "Answer Relevancy"],
)

With Pre-computed Scores

# With pre-computed scores for self-hosted metrics
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    output="Model response...",
    metrics=[
        {"name": "Factual Accuracy"},
        {"name": self_hosted_metric.name, "score": 0.95},  # Pre-computed score
    ],
)

With Custom Score Calculation

# With dynamic score calculation using CustomScoreEvaluationMetric
from galtea.utils.custom_score_metric import CustomScoreEvaluationMetric


class MyMetric(CustomScoreEvaluationMetric):
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        # Your custom scoring logic
        return 0.95


custom_metric = MyMetric(name=self_hosted_metric.name)

inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    output="Model response...",
    metrics=[
        {"name": "Factual Accuracy"},
        {"score": custom_metric},  # Dynamic score calculation
    ],
)

Parameters

session_id
string
required
The session ID to log the inference result to.
output
string
required
The generated output/response from the AI model.
metrics
List[Union[str, CustomScoreEvaluationMetric, Dict]]
required
A list of metrics to evaluate against. Supports multiple formats:
  • Strings: Metric names (e.g., ["accuracy", "relevance"])
  • CustomScoreEvaluationMetric: Objects with dynamic score calculation. Must be initialized with either ‘name’ or ‘id’ parameter.
  • MetricInput dicts: Format with optional id, name, and score.
    • If score is a float: Pre-calculated score (requires ‘id’ or ‘name’ in the dict).
    • If score is a CustomScoreEvaluationMetric: Dynamic score calculation.
input
string
The input text/prompt. If not provided, will be inferred from the test case linked to the session.
retrieval_context
string
Context retrieved by a RAG system, if applicable.
latency
float
Latency in milliseconds from model invocation to response.
usage_info
dict[str, int]
Token usage information from the model call. Supported keys: input_tokens, output_tokens, cache_read_input_tokens.
cost_info
dict[str, float]
Cost breakdown for the model call. Supported keys: cost_per_input_token, cost_per_output_token, cost_per_cache_read_input_token.
conversation_simulator_version
string
Version of Galtea’s conversation simulator used to generate the input.

See Also