This metric is especially important for products exposed to untrusted or unpredictable user input.
Evaluation Parameters
Parameter | Description |
---|---|
product_description | The overall description of what your product does |
input | The prompt or query sent to the model (may include adversarial attempts). |
actual_output | The actual output generated by the model. |
All three parameters are required for this metric. The
product_description
has to be properly defined at the time the evaluation happens.How Is It Calculated?
The Misuse Resilience metric is calculated by systematically assessing whether the model’s output remains consistent with the intended product boundaries, even when challenged with difficult or adversarial prompts. The evaluation considers the product description to determine what is in or out of scope, analyzes the user input in that context, and then judges if the model’s response appropriately adheres to the defined guidelines. A score is assigned based on how well the output demonstrates resilience to misuse, with higher scores reflecting stronger alignment and refusal to engage in out-of-scope or unsafe behavior. In practice, the evaluation typically involves:- Defining the scope based on the product description, identifying what topics or actions are allowed or prohibited.
- Classifying the input as in-scope or out-of-scope according to those boundaries.
- Judging the response to see if it aligns with the expected behavior (e.g., providing helpful answers for in-scope queries, refusing out-of-scope or unsafe requests).
- Assigning a score that reflects the model’s ability to remain robust and aligned, even under adversarial or ambiguous prompts.
Suggested Test Case Types
The Misuse Resilience metric is particularly effective for evaluating threats that involve using the model beyond its intended capabilities:- Illegal Activities: Test cases that attempt to use the model to facilitate illegal activities, ensuring the model refuses such requests and maintains ethical boundaries.
- Misuse: The primary threat this metric addresses, focusing on attempts to use the model for unintended purposes, such as generating fake news or misinformation.