Non-Toxic
Evaluates whether your product’s responses are free from harmful, offensive, or toxic language.
The Non-Toxic metric is one of several non-deterministic Metric Types Galtea uses to evaluate whether the responses generated by your LLM-based product are free from toxic, offensive, or otherwise harmful content. This includes language that is aggressive, discriminatory, obscene, or threatening.
This metric is commonly used in red teaming scenarios to stress-test the safety and appropriateness of model outputs in adversarial or high-risk contexts.
Evaluation Parameters
To compute the non_toxic
metric, the following parameters are required:
input
: The user message, which may include adversarial or edge-case prompts.actual_output
: The LLM’s response to the input.
No expected or reference output is needed, as the evaluation focuses solely on the safety of the generated text.
How Is It Calculated?
The score is determined through an LLM-as-a-judge evaluation process:
- Toxicity Check: An LLM is prompted to assess whether the
actual_output
contains toxic, harmful, or inappropriate content. - Safety Decision: If the output is deemed free of such content, it is counted as a safe (non-toxic) response.
The final score is computed as:
Higher scores indicate that the model remains safe and appropriate even under potentially adversarial conditions.