• Products
  • Solutions
  • Company
  • Resources
  • Docs
  • Pricing

Understanding LLM Hallucinations and how to mitigate them

Unaware users might unknowingly rely on these AI-generated answers, leading to adverse outcomes, especially in fact-driven fields such as law, healthcare, and finance. In this article, we delve into LLM hallucination, discussing real-world examples, causes, and mitigation strategies.

Understanding LLM Hallucinations and how to mitigate them

Understanding LLM Hallucination

Large language models (LLMs) can strike up conversations as humans do. Because of their natural language understanding capabilities, some expect LLMs to always produce coherent and correct responses. While generative AI models like ChatGPT usually generate a correct answer, there are moments when they give seemingly logical and fact-based answers that are entirely wrong. ML professionals call this hallucinating.

Unaware of incorrect information, users apply AI-generated answers to support their tasks and decisions. This might contribute to negative outcomes, particularly in fact-driven environments like law, healthcare, and finance. This article will explore LLM hallucination, expanding on real-world examples, causes, and mitigation strategies.

What do LLMs hallucinate?

Despite exposure to vast and varied datasets, large language models are not designed to correlate outputs with known facts. In a recent study, ChatGPT exhibited a hallucination rate of up to 31% when generating scientific abstracts.

As a large language model, ChatGPT’s primary purpose is applying probabilistic distribution to provide meaningful responses to the prompt. The model attempts to combine the knowledge it was exposed to into a coherent narrative. At times, it might extrapolate beyond real-world knowledge to accomplish its objective.


Example of Mistral-Instruct Hallucinating

Because of its unpredictability, LLM hallucination can help or hamper generative AI applications in various industries. For example, the ability to create beyond the confines of training data makes LLMs like ChatGPT useful for creative storytelling. Meanwhile, the same model might not be adaptable at scale for research and fact-checking use cases.

Generally, you ought to be mindful of several types of hallucinations.

  • Lies: Language models might generate text that is literally untrue and has no factual foundations.

  • Nonsensical: LLMs produce irrelevant or unasked details that don’t correlate to the prompts.

  • Source conflation: The language model attempts to combine information extracted from different sources, resulting in factual contradictions.

Get started

Start tackling your model’s hallucinations

Use our platform to assess the performance of your LLM and fine-tune it for more accuracy.

Examples of LLM hallucinations

While some users are impressed by LLM’s cleverly crafted poetries, others have fallen victim to model hallucinations. The lack of awareness amongst the broader community occasionally escalates undetected lies in AI-generated responses into something more serious. Here are several cases where LLM hallucinations had severe consequences.

Case 1 - ChatGPT falsely accused a professor of inappropriate behavior

In a recent report, ChatGPT has claimed that a US law processor has been incriminated as a sexual criminal. In its generated response, ChatGPT cited a Washington News report, which didn’t exist. The incident, if undetected, could have caused irreversible damage to the professor’s reputation.

Case 2 - New York Times discovered LLM factual fabrications.

It was never a good idea to use ChatGPT for fact-checking - as New York Times journalists discovered. In response to ’When did The New York Times first report on “artificial intelligence”?’, the language model cited an article that didn’t exist. The journalists also discovered other instances of hallucination, which could have worrisome implications in fighting fake news if people start relying on the AI chatbot to give them facts.

Case 3 - OpenAI sued for its hallucinating language model

A series of events stemming from ChatGPT’s flawed summarization resulted in a legal suit against OpenAI. When prompted to summarize the Second Amendment Foundation v. Ferguson case, ChatGPT alleged radio host Mark Walters of misappropriating funds. The allegation was untrue and resulted from the LLM hallucination, prompting legal actions from the victim.

Case 4 - ChatGPT cited non-existent medical references

Professor Robin Emsley detailed his experience experimenting with ChatGPT for writing scientific papers. While ChatGPT performed consistently in general brain-related topics, it started to hallucinate to the point of falsification when pressed on complex topics. In this case, ChatGPT provided several references that didn't exist or were irrelevant to the response it generated.

Case 5 - Two lawyers might be disbarred for citing fake cases

When arguing their case, two lawyers learned the hard way that ChatGPT might not always provide factually correct responses. They used ChatGPT to seek similar precedents and presented their cases in court without realizing that large language models are not search engines. Six of the cited cases turned out to be fake, leading to potential sanctions from the court.

Causes of LLM hallucinations

AI hallucination draws concerns from leading technological giants. Google’s senior vice president, Prabhakar Raghavan, warned of the unpredictable behavior that deep learning models can exhibit. As long as LLMs continue to generate hallucinations, industry stakeholders will remain cautious when adopting generative AI solutions.

So, what causes LLMs to hallucinate? While efforts are ongoing to explain the phenomenon, machine learning engineers have focused on these factors.

Training data

Large language models underwent extensive unsupervised training, fed with large and diverse training data. This data was derived from multiple sources, and verifying if it’s fair, unbiased, and factually correct is extremely difficult. As the model learns to contextualize tokens and generate meaningful text from the training data, it also picks up factual inaccuracies. Tokens, in natural language processing (NLP) are units of text that the algorithm works with. These units can be as small as characters or as long as words.

By itself, a language model couldn’t differentiate truth and fiction. Moreover, some training data contains differing and subjective views, complicating the model’s effort to seek ground truth. Instead, it assembles probable output by recalling the patterns learned during training. Therefore, the model risks generating text that deviates from the contextual logic.

Lack of objective alignment

LLMs are prone to random hallucinations when repurposed for tasks they were not explicitly trained on. Large language models like GPT, Falcon, and LlaMa are trained for general natural language processing tasks. They need help to infer accurately in domain-specific subjects, such as medicine, law, and finance. These models may generate a seemingly coherent narrative based on the provided tokens, but they are actually hallucinating.

Prompt engineering

To use LLMs, users input text as prompts. Prompts serve as the means to instruct GPT-3 to perform specific tasks. It's akin to coding, except you use straightforward English instead of programming languages. Therefore, you need to be clear about your objectives, but you communicate them through simple words instead of code.

The model then interprets the prompts as requests and generates an appropriate response. However, models may not behave consistently when the prompt is not adequate.

If the prompt lacks context, the LLM might generate an incorrect answer or a completely different answer that the user is not looking for.

Here, the explanation of an LLM might be correct, but not in the context of our discussion of LLMs. That’s because we need to give it more context.


Here is what happens when we alter the prompt to give the LLM more context:


Now, we have a more precise answer by providing the context behind the acronym LLM. This is just a simplistic explanation of how prompts can cause hallucinations and how to mitigate them by providing more context. More advanced practices in prompt engineering can further steer the LLM to the right direction to avoid hallucination.

Get started

LLM Evaluation and Fine-Tuning Webinar

LLMs show promise in accelerating the building of text-based applications such as chatbots and question-answering interfaces. But evaluating, knowing when to fine-tune, and fine-tuning LLMs is no small feat. Our recent webinar covers these topics to help you build your LLM apps successfully.

Strategies for Avoiding LLM Hallucinations

Eliminating irrelevant or random hallucinations is a tall order, considering deep neural networks' complexity and inability to discern facts from untruths.

A deep neural network (DNN) is an artificial neural network (ANN) with multiple layers between the input and output layers. Each layer transforms the input data in some way, capturing increasingly complex features and relationships as you move deeper into the network. This complexity makes it challenging to understand precisely what the model has learned or why it produces specific outputs.

That being said, there are measures that you can apply to reduce hallucination when training or using your model.

Context Injection

Context injection is a practice of providing sufficient information as a framework for the model to work with. This method is helpful when you need the model to answer a simple question or instruction. By embedding context into the prompt, you remove ambiguity and reduce the likelihood of inaccurate or irrelevant responses. For example, instead of merely asking, “Which team was the football champion in 2020?”, you add the following context to precede the question..

“I’m seeking details for English Premier League 2019/2012. Answer truthfully. Or reply that you didn’t have an answer if it’s beyond your comprehension”

One-shot and few-shots prompting

Both methods influence the model’s output behavior by restricting the response length and providing demonstrations respectively. In one-shot prompting, you frame the prompt in a single sentence, hoping the model would limit its response and reduce the risk of hallucinations. While the method works well in simple, everday applications, it fails to produce relevant information on topics with scarce data.

This brings us to few-shot prompting. Instead of feeding the model with a single instruction, you use a series of examples to build the context. With this information, the model is better prepared to regenerate responses that you’re anticipating.

Retrieval-Augmented Generation (RAG)

Generic models cannot infer reliably beyond the data distribution they learned from. Often, they require additional context when generating text for downstream applications. Retrieval-Augmented Generation (RAG) is a method that appends the prompt with embeddings derived from domain-specific knowledge base.

RAG augments the prompt with additional information, which enables the model to produce relevant responses and reduce its tendency to hallucinate. For example, you’re implementing a generative AI chatbot for customer support. RAG enhances the user's prompt by supplementing it with additional context that the user may not have specified. For instance, if someone asks a chatbot for guidance on using a power drill without mentioning the specific model, RAG consults a database to identify the exact model based on the limited information given. The model number is then appended to the original prompt before it's processed by the language model.

Domain-specific fine-tuning

Fine-tuning is a common approach that ML teams use to teach a model new knowledge while retaining its existing competencies, particularly in tasks related to processing natural language. The technique also helps shape the model’s response and prevent it from exhibiting undesirable behaviors. As the model is re-trained with domain-specific knowledge, it is less likely to make up plausible but untruth responses.

In fine-tuning, the model is further trained on a new dataset that is usually smaller and more task-specific than the original dataset. The process aims to adapt the generalized learning of the pre-trained model to the specific characteristics of the new dataset, thereby improving its performance on tasks related to that dataset.

Build High Quality Datasets for Fine-Tuning

Kili Technology provides annotation tools to prepare reliable and high-quality input for fine-tuning large language models to improve their accuracy and truthfulness. For example, you collaborate with healthcare experts to label medical research papers, based on which the model learns to identify, classify, or summarize patient records.  With Kili, you can extract curated corpus, label it, fine-tune any available model, and compare evaluation results on a centralized platform. Check out this guide to learn more.

Well-labeled high-quality data can limit a model’s tendency to hallucinate. Yet, over-reliance on manual labeling can create a bottleneck that impedes model training and increases cost. Kili strikes a balance by combining domain expert labelers and AI-assisted annotation.

Get started

High-quality fine-tuning datasets for LLMs

Craft fine-tuning datasets for Large Language Models (LLMs) effortlessly and with utmost precision using our customizable annotation tasks and productivity-focused interface. Benefit from an integrated and fully managed professional workforce that spans across hundreds of domain expertise areas and supports 30 different languages.

Continue reading