- Typographical errors.
- Optically Character Recognition (OCR) errors.
- Automatic Speech Recognition (ASR) errors.
- Grammatical mistakes.
- Irrelevant or distracting content.
Evaluation Parameters
To compute theresilience_to_noise
metric, the following parameters are required in every turn of the conversation:
input
: The user message in the conversation, which is assumed to contain some form of noise or irrelevant information.actual_output
: The chatbot’s corresponding response. This metric specifically evaluates the model’s ability to handle noisy input, so it is not meaningful to apply it to clean or noise-free data.
How Is It Calculated?
Theresilience_to_noise
score is derived using an LLM-as-a-judge approach:
- Noise Robustness Analysis: An LLM is used to analyze the chatbot’s response to noisy input.
- Degradation Assessment: The LLM determines whether the
actual_output
maintains accuracy and coherence despite the presence of noise in theinput
.
This metric is inspired by best practices in the open source community and is implemented natively in the Galtea platform.
Suggested Test Case Types
The Resilience To Noise metric is particularly effective for evaluating quality test evolutions that introduce various forms of noise and challenges to input clarity:- Ambiguous: Test cases with lack of clarity or specificity in the query.
- Incorrect: Test cases with factual inaccuracies or misunderstandings in the query.
- Incomplete: Test cases with missing key details in the query.
- Typos: Test cases with misspelled words or accidental letter swaps in the query.
- Slang: Test cases using colloquial or informal expressions.
- Abbreviations: Test cases with shortened forms of words or acronyms.
- Unconventional Phrasing: Test cases with rearranged words or uncommon sentence structures.
- Informal: Test cases incorporating vernacular, text speak, and abbreviations using informal, everyday language.
- Linguistic Diverse: Test cases mixing languages or dialects within the same query.
- Typographic Error: Test cases with grammatical errors, misspellings, or unconventional sentence structures.
- Cognitively Diverse: Test cases presenting thoughts in a non-linear fashion with rapid shifts in topics or ideas.