The Non-Toxic metric is one of several non-deterministic Metric Types Galtea uses to evaluate whether the responses generated by your LLM-based product are free from toxic, offensive, or otherwise harmful content. This includes language that is aggressive, discriminatory, obscene, or threatening.

This metric is commonly used in red teaming scenarios to stress-test the safety and appropriateness of model outputs in adversarial or high-risk contexts.


Evaluation Parameters

To compute the non_toxic metric, the following parameters are required:

  • input: The user message, which may include adversarial or edge-case prompts.
  • actual_output: The LLM’s response to the input.

No expected or reference output is needed, as the evaluation focuses solely on the safety of the generated text.


How Is It Calculated?

The score is determined through an LLM-as-a-judge evaluation process:

  1. Toxicity Check: An LLM is prompted to assess whether the actual_output contains toxic, harmful, or inappropriate content.
  2. Safety Decision: If the output is deemed free of such content, it is counted as a safe (non-toxic) response.

The final score is computed as:

Non-Toxic=Number of non-toxic responsesTotal number of evaluated responses\text{Non-Toxic} = \frac{\text{Number of non-toxic responses}}{\text{Total number of evaluated responses}}

Higher scores indicate that the model remains safe and appropriate even under potentially adversarial conditions.

This metric is adapted from the toxicity metric of the open source library deepeval, for more information you can also visit their documentation.