How to interpret and act on faithfulness and relevancy scores
The ideal scenario is to have both the faithfulness score and relevancy score close to 1; this indicates that the right information is presented accurately.
Understand how scores are calculated
Expert uses standard industry formulas to measure how accurate the generated answer is to your content (faithfulness) and how well the answer addresses the user’s question (relevancy).
Faithfulness
To calculate faithfulness, Expert:
- Identifies all the claims in the response.
- Checks each claim to see if it can be inferred from the retrieved context.
- Computes the faithfulness score using the formula: (# of claims in the response supported by the retrieved context / Total claims in the response).
Relevancy
To calculate relevancy, Expert:
- Generates a set of artificial questions based on the response. These questions are designed to reflect the content of the response.
- Computes the similarity between the user input and each generated question.
- Averages the similarity scores.
Interpret the scores
High faithfulness, low relevance
The AI gives you correct information, but the information does not answer the question.
Example scenario:
- Faithfulness = 1: The AI is being completely honest and accurate with the information it was given.
- Relevance = 0.2: But the information was not useful enough to answer the question.
Analogy:
You ask a librarian "How do I bake a chocolate cake?" and they hand you a book about the history of cocoa beans. The librarian then gives you a perfectly accurate summary of everything in that book. They are 100% truthful (high faithfulness) but the book does not actually help you bake a cake (low relevance).
Low faithfulness, high relevance
The AI understands what you are asking about but gives you mostly incorrect or fabricated information.
Example scenario:
Relevance = 1: The system found perfectly relevant information for your question.
Faithfulness = 0.2: But the AI made up or twisted most of the facts when answering.
Analogy:
You ask a librarian "How do I bake a chocolate cake?" and they find the perfect chocolate cake recipe book (great relevance), but they tell you to bake the cake at 900°F for 10 minutes and add 5 cups of salt; these details are incorrect (poor faithfulness).
Take action on the data
Generally, low faithfulness requires a change in the prompting and a review of the kernels threshold, while low relevancy requires a change to either the question or the context.
Example 1: Most queries score above 0.7
Most of the queries scoring greater than 0.7 is a strong indication that the persona and the threshold settings are optimized. The questions that were scored 0.69 or below will likely require content optimization to see improvements.
| Queries requiring content review | Queries with good to excellent result | |
|---|---|---|
| Faithfulness | 12% | 88% |
| Relevance | 10% | 89% |
Example 2: Most queries need review
When most of the questions require review, start with the persona and threshold settings. This does not mean there are no issues with content, but you should optimize your settings before making content changes.
| Queries requiring content review | Queries with good to excellent result | |
|---|---|---|
| Faithfulness | 91% | 9% |
| Relevance | 90% | 10% |

