How does the Expert team select and validate an LLM?

Last updated
Save as PDF

How do we select a model?

Selecting an LLM is a complex decision matrix where trade-offs will often need to be made. The ideal model balances these criteria effectively while aligning closely with the specific goals and values of the project. Continuous monitoring and evaluation are necessary as models evolve and new ones emerge. There are several factors we use when selecting an LLM model:

Performance: This refers to both the qualitative and quantitative aspects of an LLM's output:
- Qualitatively, the model should generate coherent, contextually appropriate, and nuanced text.
- Quantitatively, it should have low latency and high throughput to handle the required scale of operations.
Cost: The economic aspect of using an LLM can be significant, especially when scaled up. Cost considerations include the direct expense of using or training the model, as well as the computational resources required for operation and maintenance. We conduct cost-benefit analysis to ensure the model's value justifies its expense.
Security: Given the sensitive nature of data that LLMs might process, security is paramount. This includes data encryption at rest, data encryption in transit, and access controls. The model should also have robust mechanisms to prevent data leakage and ensure user privacy, and it must not collect data from usage.
Model security: The model should resist attacks that could cause it to generate incorrect or harmful text.
Scalability: The ability of an LLM to scale efficiently is crucial for handling growing data volumes and user requests without a significant reduction in performance or speed. Scalability also refers to the model's capacity to incorporate new data and adapt to different contexts without extensive retraining.
Fairness and biases: It is essential to assess models for fairness and bias. We evaluate LLMs for their tendency to generate biased outputs, which could perpetuate stereotypes or discriminate against certain groups.
Interoperability: The selected LLM should play well with other systems and technologies in the workflow and ecosystem. This includes easy integration with existing databases, software, and APIs; as well as compatibility with various data formats.
Ethical Considerations: The model should adhere to ethical guidelines for AI use, ensuring that its deployment does not cause harm or adverse societal impacts.
Robustness: The model should be robust to minor input variations. We test model sensitivity against factors like capitalization, punctuation, typos, and noisy neighbors (resource allocation between tenants of various sizes).

Do we use models that have been certified by a standards body?

When we started our efforts, few industry standards existed. We continue to monitor these, and adopt standards when they become available and where they make sense.

How do we control for prompt injection?

For our Completions endpoint we use prompt engineering, which is applied to each query when the endpoint is called.
We also ensure the LLM does not remember the conversation, so each query is independent of any other query.
Further, we are planning to employ an ethical hacking service for an external perspective.
Finally, we are aware of common prompt injection hacks. We periodically test them against the LLM to assess and address potential vulnerabilities.

How do we manage or reduce hallucinations?

Prompts and persona engineering: GenSearch can better understand the context you are working with and the expectations visitors to your site will have when asking questions.
Using your Expert content as the primary source for the GenResponses enables the highest likelihood the base models will not invoke any data that is not from your Expert content.
Using RAG: Retrieval Augmented Generation is used industry-wide as a strategy for more desirable responses.
No retrieval, no answer: Without relevant kernels, GenSearch will not generate a response.

Does GenSearch use RAG?

Yes, Expert GenSearch is a Retrieval Augmented Generation (RAG) system.

How is GenSearch better than ChatGPT or other off-the-shelf / public LLM tools?

Expert uses your site content and customized persona information. It also respects content permissions so responses are tailored to your customer base and business needs, unlike publicly available LLM and AI tools. This enables you to control the information site visitors can find, and guide their experiences to lead to an optimal user experience.