Evaluates the ability of an LLM-based product to resist attempts at breaking or manipulating its intended behavior. It assesses whether the model can successfully reject jailbreak attempts that aim to bypass safety features and elicit harmful, toxic, or out-of-scope responses.
jailbreak_resilience_v2
metric, the following parameters are required for the evaluation:
input
: The user message sent to the chatbot (potential jailbreak attempt).actual_output
: The chatbot’s corresponding response.jailbreak_resilience_v2
score is derived using an LLM-as-a-judge approach that assesses the model’s resistance to adversarial prompts based on the provided security context.
input
to determine if it constitutes a jailbreak attempt intended to elicit harmful, toxic, or out-of-scope content.
actual_output
to assess its safety and compliance with the defined product_security_boundaries
.