Skip to main content
NiCE CXone Mpower Expert
Expert Success Center

How to interpret and act on faithfulness and relevancy scores

Analogies and examples illustrate how to understand the scores in the Faithfulness & Relevancy report, and what actions you should take to improve those scores.

The ideal scenario is to have both the faithfulness score and relevancy score close to 1; this indicates that the right information is presented accurately.

Understand how scores are calculated

Expert uses standard industry formulas to measure how accurate the generated answer is to your content (faithfulness) and how well the answer addresses the user’s question (relevancy).

Faithfulness

To calculate faithfulness, Expert:

  1. Identifies all the claims in the response.
  2. Checks each claim to see if it can be inferred from the retrieved context.
  3. Computes the faithfulness score using the formula: (# of claims in the response supported by the retrieved context / Total claims in the response).

Relevancy 

To calculate relevancy, Expert:  

  1. Generates a set of artificial questions based on the response. These questions are designed to reflect the content of the response.
  2. Computes the similarity between the user input and each generated question.
  3. Averages the similarity scores.

Interpret the scores

High faithfulness, low relevance

The AI gives you correct information, but the information does not answer the question.

Example scenario:

  • Faithfulness = 1: The AI is being completely honest and accurate with the information it was given.
  • Relevance = 0.2: But the information was not useful enough to answer the question.

Analogy:

You ask a librarian "How do I bake a chocolate cake?" and they hand you a book about the history of cocoa beans. The librarian then gives you a perfectly accurate summary of everything in that book. They are 100% truthful (high faithfulness) but the book does not actually help you bake a cake (low relevance).

Low faithfulness, high relevance

 The AI understands what you are asking about but gives you mostly incorrect or fabricated information.

Example scenario:

Relevance = 1: The system found perfectly relevant information for your question.
Faithfulness = 0.2: But the AI made up or twisted most of the facts when answering.

Analogy:

You ask a librarian "How do I bake a chocolate cake?" and they find the perfect chocolate cake recipe book (great relevance), but they tell you to bake the cake at 900°F for 10 minutes and add 5 cups of salt; these details are incorrect (poor faithfulness).

Take action on the data

Generally, low faithfulness requires a change in the prompting and a review of the kernels threshold, while low relevancy requires a change to either the question or the context.

Example 1: Most queries score above 0.7

Most of the queries scoring greater than 0.7 is a strong indication that the persona and the threshold settings are optimized. The questions that were scored 0.69 or below will likely require content optimization to see improvements.

Queries requiring content review Queries with good to excellent result
Faithfulness 12% 88%
Relevance 10% 89%

Example 2: Most queries need review

When most of the questions require review, start with the persona and threshold settings. This does not mean there are no issues with content, but you should optimize your settings before making content changes.

Queries requiring content review Queries with good to excellent result
Faithfulness 91% 9%
Relevance 90% 10%

 

  • Was this article helpful?