Evaluation Parameters
To compute theconversation_relevancy metric, the following parameters are required in every turn of the conversation:
input: The user message in the conversation.actual_output: The chatbot’s corresponding response.
How Is It Calculated?
Theconversation_relevancy score is computed using an LLM-as-a-judge approach:
- Analyze Context Flow: The LLM reads the conversation sequentially to understand the evolving context.
- Evaluate Each Turn: For every agent response, the LLM determines if it directly addresses the user’s immediate input, makes sense given previous turns, and stays on-topic.
- Consider Natural Conversation Dynamics: The LLM accounts for minor clarifications or brief tangents that serve a purpose, and focuses on significant relevance failures rather than minor imperfections.
- Score 1.0 (Relevant): The agent maintains overall coherence with only minor or justifiable deviations. Responses align with the user’s intent and conversation history.
- Score 0.0 (Irrelevant): The agent demonstrates significant irrelevance issues—repeatedly ignoring context, contradicting established information, or going off-topic without justification.