Create and Evaluate Inference Result

Returns

Returns a tuple containing:

An InferenceResult object
A list of Evaluation objects, one for each metric provided

Usage

This method combines creating an inference result with its evaluation in a single convenient call. It’s the recommended approach for single-turn evaluations, replacing the deprecated galtea.evaluations.create_single_turn() method.

Basic Example

# Create inference result and evaluate in a single call
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    input="What is the capital of France?",
    output="The capital of France is Paris.",
    metrics=["Factual Accuracy", "Answer Relevancy"],
)

With Pre-computed Scores

# With pre-computed scores for self-hosted metrics
inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    output="Model response...",
    metrics=[
        {"name": "Factual Accuracy"},
        {"name": self_hosted_metric.name, "score": 0.95},  # Pre-computed score
    ],
)

With Custom Score Calculation

# With dynamic score calculation using CustomScoreEvaluationMetric
from galtea.utils.custom_score_metric import CustomScoreEvaluationMetric


class MyMetric(CustomScoreEvaluationMetric):
    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        # Your custom scoring logic
        return 0.95


custom_metric = MyMetric(name=self_hosted_metric.name)

inference_result, evaluations = galtea.inference_results.create_and_evaluate(
    session_id=session_id,
    output="Model response...",
    metrics=[
        {"name": "Factual Accuracy"},
        {"score": custom_metric},  # Dynamic score calculation
    ],
)

Parameters

session_id

string

required

The session ID to log the inference result to.

output

string

required

The generated output/response from the AI model.

metrics

List[Union[str, CustomScoreEvaluationMetric, Dict]]

required

A list of metrics to evaluate against. Supports multiple formats:

Strings: Metric names (e.g., ["accuracy", "relevance"])
CustomScoreEvaluationMetric: Objects with dynamic score calculation. Must be initialized with either ‘name’ or ‘id’ parameter.
MetricInput dicts: Format with optional id, name, and score.
- If score is a float: Pre-calculated score (requires ‘id’ or ‘name’ in the dict).
- If score is a CustomScoreEvaluationMetric: Dynamic score calculation.

input

string

The input text/prompt. If not provided, will be inferred from the test case linked to the session.

retrieval_context

string

Context retrieved by a RAG system, if applicable.

latency

float

Latency in milliseconds from model invocation to response.

usage_info

dict[str, int]

Token usage information from the model call. Supported keys: input_tokens, output_tokens, cache_read_input_tokens.

cost_info

dict[str, float]

Cost breakdown for the model call. Supported keys: cost_per_input_token, cost_per_output_token, cost_per_cache_read_input_token.

conversation_simulator_version

string

Version of Galtea’s conversation simulator used to generate the input.

SDK

Concepts

Create and Evaluate Inference Result

Returns

Usage

Basic Example

With Pre-computed Scores

With Custom Score Calculation

Parameters

See Also

SDK

Concepts

​Returns

​Usage

​Basic Example

​With Pre-computed Scores

​With Custom Score Calculation

​Parameters

​See Also

Returns

Usage

Basic Example

With Pre-computed Scores

With Custom Score Calculation

Parameters

See Also