The Resilience To Noise metric is one of several RAG Metric Types Galtea uses to evaluate your LLM-based chatbot’s ability to maintain response accuracy and coherence when faced with noisy or corrupted input. This includes:
  • Typographical errors.
  • Optically Character Recognition (OCR) errors.
  • Automatic Speech Recognition (ASR) errors.
  • Grammatical mistakes.
  • Irrelevant or distracting content.
This metric is essential for assessing how well your product performs in real-world scenarios where user input may not always be clean or well-formed.

Evaluation Parameters

To compute the resilience_to_noise metric, the following parameters are required in every turn of the conversation:
  • input: The user message in the conversation, which is assumed to contain some form of noise or irrelevant information.
  • actual_output: The chatbot’s corresponding response. This metric specifically evaluates the model’s ability to handle noisy input, so it is not meaningful to apply it to clean or noise-free data.

How Is It Calculated?

The resilience_to_noise score is derived using an LLM-as-a-judge approach:
  1. Noise Robustness Analysis: An LLM is used to analyze the chatbot’s response to noisy input.
  2. Degradation Assessment: The LLM determines whether the actual_output maintains accuracy and coherence despite the presence of noise in the input.
Scores range from 0 (completely disrupted by noise) to 1 (fully robust to noise), helping you monitor and improve your model’s resilience in practical, noisy environments.
This metric is inspired by best practices in the open source community and is implemented natively in the Galtea platform.

Suggested Test Case Types

The Resilience To Noise metric is particularly effective for evaluating quality test evolutions that introduce various forms of noise and challenges to input clarity:
  • Ambiguous: Test cases with lack of clarity or specificity in the query.
  • Incorrect: Test cases with factual inaccuracies or misunderstandings in the query.
  • Incomplete: Test cases with missing key details in the query.
  • Typos: Test cases with misspelled words or accidental letter swaps in the query.
  • Slang: Test cases using colloquial or informal expressions.
  • Abbreviations: Test cases with shortened forms of words or acronyms.
  • Unconventional Phrasing: Test cases with rearranged words or uncommon sentence structures.
  • Informal: Test cases incorporating vernacular, text speak, and abbreviations using informal, everyday language.
  • Linguistic Diverse: Test cases mixing languages or dialects within the same query.
  • Typographic Error: Test cases with grammatical errors, misspellings, or unconventional sentence structures.
  • Cognitively Diverse: Test cases presenting thoughts in a non-linear fashion with rapid shifts in topics or ideas.