Strategies
Galtea platform uses a variety of red teaming strategies to generate adversarial test cases and evaluate the robustness of your product.
What is a Red Teaming Strategy?
A red teaming strategy is a specific technique applied to an initial prompt (often derived from a selected threat) to modify or obfuscate it. The goal is to create more sophisticated or evasive inputs that test the limits and probe for vulnerabilities in AI models. These strategies simulate various real-world adversarial attack vectors, helping you evaluate your product’s robustness against attempts to bypass safety mechanisms, content filters, or detection systems.
Basically, we will create prompts based on the selected threats and then apply the selected strategies to those prompts. This allows us to generate multiple variations of each threat, increasing the diversity and effectiveness of your adversarial testing.
In the SDK, you can select one or more red teaming strategies to apply to your test cases. These strategies are applied to all selected threats, generating variations of each threat prompt with the chosen strategy.
Threats vs. Strategies
- Threats define the goal or target of the adversarial input (e.g., Misuse, Data Leakage).
- Strategies define the method or technique used to craft or modify the prompt to achieve that threat (e.g., Base64 encoding, Leetspeak).
Think of threats as the “what” and strategies as the “how” in adversarial testing.
Available Red Teaming Strategies
Below are the main red teaming strategies available in Galtea:
- Original: The unmodified prompt generated directly from the selected threat, without any additional obfuscation or transformation. This serves as the baseline for comparison in red teaming tests and is always included by default (it cannot be deselected).
- Base64: Obfuscate prompts by encoding them in base64, requiring decoding before interpretation. Helps evade simple keyword-based filters.
- Hex: Encode prompts in hexadecimal format to obscure their content. Useful for bypassing string-matching filters.
- Homoglyph: Replace characters with visually similar Unicode characters to evade detection. Can fool naive text-matching systems.
- Leetspeak: Substitute letters with numbers or symbols (e.g., “leet speak”) to bypass filters. Commonly used to evade moderation.
- Morse Code: Encode prompts using Morse code to obscure intent. Requires decoding, making detection harder.
- Rot13: Apply the ROT13 cipher to shift letters and obfuscate the prompt. Simple obfuscation to bypass basic checks.
- Zero Width Insertion: Insert zero-width characters to break up keywords that might be caught by simple string matching filters without affecting readability for humans.
- Emoji Obfuscation: Use emojis to replace or supplement words, making detection harder. Obscures meaning for keyword-based systems.
How Strategies Are Applied
When you generate red teaming tests in Galtea, you can select one or more strategies. Each selected strategy is applied to all chosen threats, creating multiple variations of each threat prompt. This approach increases the diversity and effectiveness of your adversarial testing.
For example, if you select the “Misuse” threat and the “Base64” and “Leetspeak” strategies, Galtea will generate test cases that attempt to use your model for unintended purposes using both base64-encoded and leetspeak-modified prompts.
Selecting Strategies in the Platform
When creating a red teaming test in the Galtea platform:
- Choose your desired threats (e.g., Misuse, Data Leakage, etc.).
- Select one or more red teaming strategies from the list.
- If you specify a maximum number of test cases, Galtea will distribute this count across the combinations of selected threats and applied strategies.
- The final number of generated test cases may vary depending on your max_test_cases setting (if used), the number of threat/strategy combinations, and potentially the ability to successfully apply certain strategies to certain threats.
If you need additional strategies or want to suggest new ones, please contact us at support@galtea.ai.