Foundation Model Evolution
Keeping up with the pace of innovation periodically requires foundation model and/or vector embedding model upgrades and changes. This is managed much like a notable traditional software upgrade of an application. The motivations are to ensure systems remain secure, efficient, and equipped to provide the best possible results.
Reasons for model updates
- A model, or embedding service provider, no longer supports a current version.
- A new model with equal or better results is introduced.
- Model prices change.
- Better scalability, reliability, and security are offered via a model upgrade.
Process for model adoption
Benchmark testing
All models are evaluated based on the industry benchmarks before considering utilization. Generally, the model under consideration must meet or exceed the benchmarks. These benchmarks are evolving, and the goal is to compare the same benchmarks for any model-to-model changes.
Examples of common industry benchmarks
Feature | Model Example A | Model Example B |
---|---|---|
Input context window | 300,000 tokens | 100,000 tokens |
Maximum output tokens | 5,000 tokens | Unknown |
Supported modalities | Text, image, and video processing | Text only |
Release date | December 2, 2024 | August 9, 2023 |
Knowledge cut-off date | Purposefully not disclosed | January 2023 |
MMLU Benchmark (Massive Multitask Language Understanding) | 85.9% (Chain-of-Thought) | 73.4% (5-shot scenario) |
HumanEval (code generation) | 89% pass@1 | Not available |
MATH Benchmark | 76.6% (Chain-of-Thought) | Not available |
GPQA Benchmark (PhD-level knowledge) | 46.9% | Not available |
IFEval (Instruction following) | 92.1% | Not available |
Baseline (RAGAS) testing
Before implementation, all models undergo extensive baseline testing by the CXone Mpower Expert team and follow structured release / upgrade processes.
- The baseline testing consists of a set of questions which generate a RAGAS (Retrieval Augmented Generation Assessment Score).
- The existing model and the new model are compared.
- New models are adopted when these scores are at baseline or better, or when reasoned evidence can explain why a result is not at baseline or higher.
- The new model is exercised on Expert Help (this site) before broad deployment or general availability.
Comparison of Models using RAGAS
Model | Average of context precision | Average of context recall | Average of answer relevancy | Average of faithfulness |
---|---|---|---|---|
Baseline Model A | 0.43 | 0.83 | 0.88 | 0.86 |
Baseline Model B | 0.43 | 0.83 | 0.89 | 0.81 |
Baseline Model C | 0.43 | 0.83 | 0.87 | 0.86 |
CXone Mpower Expert model usage
Features that use LLM / foundation models and/or embeddings in the product include:
- Generations (Completions): GenSearch
- Content Assessment: Customers can upload questions and determine if the content in Expert will answer those questions.
- Answer Relevance, Answer Faithfulness
- Relevance Reporting: Evaluates how relevant the generated answer is to the question submitted.
- Faithfulness: Computes how accurately the generated answer reflects the information in the retrieved context.
Customer deployment of model changes
- For Generations (Completions), customers receive a model adoption period to test new models before implementation. In some cases, such as when a model provider discontinues a model or its support, customers may have to adopt changes immediately.
- The model adoption period is determined by NICE.
- The intent is to provide a minimum of 6 weeks for customers to evaluate new models.
- The Model adoption period would start after the above benchmarking and baseline testing have concluded.
- Customers can evaluate the new models if that is desired, but evaluation or action from the customer is not required.
- When customers perform testing and questions arise, a support ticket can be created and the Product Success Manager can be engaged.
- For Content Assessment, Answer Relevance, Answer Faithfulness, and Content Editor (creation, update, and other functions), customers receive advance notice of model changes, and a migration schedule is provided.