Skip to main content
Deprecated: This method is deprecated and will be removed in a future version.Use one of the following alternatives instead:
  • galtea.inference_results.create_and_evaluate() - Creates an inference result and evaluates it in a single call
  • galtea.evaluations.create(inference_result_id=...) - Evaluates an existing inference result

Returns

Returns a list of Evaluation objects, one for each metric provided.

Usage

This method is versatile and can be used for two main scenarios:
  1. Test-Based Evaluation: When you provide a test_case_id, Galtea evaluates your product’s performance against a predefined challenge.
  2. Production Monitoring: When you set is_production=True and provide an input, Galtea logs and evaluates your product’s performance in a live environment.

Development Testing

For non self-hosted metrics
evaluations = galtea.evaluations.create_single_turn(
    version_id=version_id,
    test_case_id=test_case_id,
    actual_output="Paris",
    metrics=[{"name": "Factual Accuracy"}],
)
If you pre-computed the score
evaluations = galtea.evaluations.create_single_turn(
    version_id=version_id,
    test_case_id=test_case_id,
    actual_output="Paris",
    metrics=[{"name": self_hosted_metric.name, "score": 0.75}],
)

Production Monitoring

In order to monitor your product in a production environment, you can create evaluations not linked to a specific test case, but you need to set the is_production flag to True.
evaluations = galtea.evaluations.create_single_turn(
    version_id=version_id,
    is_production=True,
    input="What is the capital of France?",
    actual_output="Paris",
    metrics=[{"name": "Answer Relevancy"}],
)

Advanced Usage

You can also create evaluations using self-hosted metrics with dynamically calculated scores by utilizing the CustomScoreEvaluationMetric class, which allows for more complex evaluation scenarios.
# Define scoring logic as a class
class PolitenessCheck(CustomScoreEvaluationMetric):
    def __init__(self):
        super().__init__(name="politeness-check")

    def measure(self, *args, actual_output: str | None = None, **kwargs) -> float:
        if actual_output is None:
            return 0.0
        polite_words = ["please", "thank you", "you're welcome"]
        return (
            1.0 if any(word in actual_output.lower() for word in polite_words) else 0.0
        )


# Create the metric in the platform if it doesn't exist yet
# Note: This can be done via the Dashboard too
try:
    metric = galtea.metrics.get_by_name(name="politeness-check")
except Exception:
    metric = None
if metric is None:
    galtea.metrics.create(
        name="politeness-check",
        source="self_hosted",
        description="Checks if polite words appear in the output",
        test_type="ACCURACY",
    )

# Now, evaluate the single turn using both Galtea-hosted and self-hosted metrics
evaluations = galtea.evaluations.create_single_turn(
    is_production=True,
    version_id=version_id,
    input="Hello!",
    actual_output="Hi there! How can I assist you today?",
    metrics=[
        {"name": "Role Adherence"},  # You can use galtea-hosted metrics simultaneously
        {"score": PolitenessCheck()},  # Self-hosted with dynamic scoring
        # Note: No 'name' or 'id' in dict - it comes from PolitenessCheck(name="...")
    ],
)

Parameters

version_id
string
required
The ID of the version you want to evaluate.
metrics
List[Union[str, CustomScoreEvaluationMetric, Dict]]
required
A list of metrics to use for the evaluation.The MetricInput dictionary supports the following keys:
  • id (string, optional): The ID of an existing metric
  • name (string, optional): The name of the metric
  • score (float | CustomScoreEvaluationMetric, optional): For self-hosted metrics only
    • If float: Pre-computed score (0.0 to 1.0). Requires id or name in the dict.
    • If CustomScoreEvaluationMetric: Score will be calculated dynamically. The CustomScoreEvaluationMetric instance must be initialized with name or id. Do NOT provide id or name in the dict when using this option.
For self-hosted metrics, both score options are equally valid: pre-compute as a float, or use CustomScoreEvaluationMetric for dynamic calculation. Galtea-hosted metrics automatically compute scores and should not include a score field.
actual_output
string
required
The actual output produced by the product.
test_case_id
string
The ID of the test case to be evaluated. Required for non-production evaluations.
input
string
The input text/prompt. Required for production evaluations where no test_case_id is provided.
is_production
boolean
Set to True to indicate the evaluation is from a production environment. Defaults to False.
retrieval_context
string
The context retrieved by your RAG system that was used to generate the actual_output.
latency
float
Time in milliseconds from the request to the LLM until the response was received.
usage_info
dict[str, int]
Information about token usage during the model call. Possible keys include:
  • input_tokens: Number of input tokens sent to the model.
  • output_tokens: Number of output tokens generated by the model.
  • cache_read_input_tokens: Number of input tokens read from the cache.
cost_info
dict[str, float]
Information about the cost per token during the model call. Possible keys include:
  • cost_per_input_token: Cost per input token sent to the model.
  • cost_per_output_token: Cost per output token generated by the model.
  • cost_per_cache_read_input_token: Cost per input token read from the cache.
conversation_simulator_version
string
The version of Galtea’s conversation simulator used to generate the user message (input). This should only be provided if the input was generated using the conversation_simulator_service.