AI hallucinations

By Shail Kaveti

January 1, 1970

About this collection

## AI Hallucinations: A Critical Challenge in Large Language Models This collection examines the persistent problem of AI hallucinations—instances where large language models (LLMs) confidently generate false information. The research reveals that hallucination rates vary significantly across models, with GPT-4 showing 28.6% hallucination rates compared to Bard's 91.4% in systematic review contexts. **Hallucinations are fundamentally rooted in how LLMs work**: they predict statistically likely responses rather than factual accuracy, compressing vast training data into parameters that inevitably lose some information. A key insight emerges that **current evaluation methods inadvertently encourage hallucinations** by rewarding accuracy over uncertainty acknowledgment—essentially teaching models to guess rather than admit ignorance. While complete elimination appears impossible due to the statistical nature of next-word prediction, several mitigation strategies show promise: retrieval-augmented generation (RAG), external fact-checking, self-reflection techniques, and semantic consistency analysis. The research suggests that **reframing evaluation metrics** to penalize confident errors more than uncertainty could significantly reduce hallucinations. This represents a shift from viewing hallucinations as a technical glitch to understanding them as an inherent consequence of current training paradigms that can be systematically addressed through better evaluation frameworks and uncertainty-aware design.

Curated Sources

Journal of Medical Internet Research - Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis

This study evaluates the performance of Large Language Models (LLMs) such as ChatGPT and Bard in generating references for systematic reviews related to shoulder rotator cuff pathology. The analysis includes 11 systematic reviews across four medical fields, resulting in 33 prompts to LLMs and 471 analyzed references. Key findings indicate that GPT-4 outperformed other models in retrieving non-hallucinated references, while Bard was the least accurate. The study highlights significant concerns regarding the reliability of LLMs in academic research, particularly their tendency to generate 'hallucinated' references. The hallucination rates were 39.6% for GPT-3.5, 28.6% for GPT-4, and 91.4% for Bard. The study concludes that LLMs should not be used as the primary tool for conducting systematic reviews without thorough human validation.

Key Takeaways

  • LLMs like ChatGPT and Bard are not reliable for conducting systematic reviews due to high hallucination rates.
  • GPT-4 is the most efficient among tested models in retrieving non-hallucinated references.
  • The study reveals biases in LLMs, such as favoring open-access papers and American authors.
  • Human validation of LLM-generated references is crucial to maintain scientific integrity.
  • The findings suggest that LLMs require refinement before being used for rigorous academic purposes.

AI hallucinations can’t be stopped — but these techniques can limit their damage

Large language models (LLMs) behind AI chatbots are prone to 'hallucinations', where they generate false or misleading information. Researchers are developing techniques to limit these hallucinations, including external fact-checking, internal self-reflection, and retrieval augmented generation (RAG). Studies have shown that newer models have lower hallucination rates, but the problem persists. Techniques like 'chain of thought' prompting and assessing 'semantic entropy' can help identify and reduce hallucinations. However, completely eliminating hallucinations is considered impossible due to the fundamental nature of LLMs.

Key Takeaways

  • Researchers are making progress in reducing AI hallucinations through techniques like RAG and self-reflection, but completely eliminating them is impossible.
  • Newer LLMs have lower hallucination rates, but may still produce convincing but incorrect information.
  • Assessing semantic entropy and using 'chain of thought' prompting can help identify and reduce hallucinations in AI responses.

Why language models hallucinate | OpenAI

The document discusses why language models, including those developed by OpenAI like ChatGPT and GPT-5, hallucinate or generate false information. It argues that standard training and evaluation procedures reward guessing over acknowledging uncertainty, leading to hallucinations. The authors suggest that current evaluation methods prioritize accuracy over honesty about uncertainty, encouraging models to guess rather than say 'I don't know.' They propose a new evaluation approach that penalizes confident errors more than uncertainty and gives partial credit for expressions of uncertainty. The document also explains how hallucinations originate from next-word prediction in language model pretraining, particularly for arbitrary low-frequency facts. The authors conclude that hallucinations are not inevitable and can be reduced by changing evaluation metrics to reward uncertainty and by improving model calibration.

Key Takeaways

  • Changing evaluation metrics to penalize confident errors more than uncertainty can help reduce hallucinations in language models.
  • Language models can be designed to abstain from answering when uncertain, rather than guessing, which is a more reliable approach than trying to achieve 100% accuracy.
  • The origin of hallucinations in language models lies in the next-word prediction task during pretraining, where models struggle to distinguish valid from invalid statements, especially for low-frequency facts.

[2311.05232] A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

This document is a survey on hallucination in large language models, exploring its principles, taxonomy, challenges, and open questions. The survey, authored by Lei Huang and 10 other researchers, was accepted by ACM Transactions on Information Systems (TOIS) and is available on arXiv. It delves into the phenomenon of hallucination, where language models generate content not grounded in reality or input data, examining its causes, types, and implications for NLP applications. The authors discuss various challenges associated with hallucinations, including their impact on the reliability and trustworthiness of language models, and outline open questions that require further research.

Key Takeaways

  • The survey highlights the critical issue of hallucination in large language models, emphasizing its potential to undermine the reliability of NLP applications.
  • It provides a comprehensive taxonomy of hallucinations, categorizing them based on their characteristics and sources, which is crucial for developing targeted mitigation strategies.
  • The authors identify key challenges in addressing hallucinations, including the need for more robust evaluation metrics and training methods that promote factual accuracy.
  • The survey underscores the importance of continued research into hallucination, pointing out that resolving this issue is essential for advancing the trustworthiness and applicability of large language models.
  • By outlining open questions and future directions, the survey serves as a roadmap for researchers and developers working to improve the performance and reliability of language models.

Frequently Asked Questions

  • How do the hallucination rates observed in medical systematic reviews (28.6% for GPT-4, 91.4% for Bard) compare to the rates found in Vectara's document summarization tests (1.8% for GPT-4), and what does this difference reveal about task complexity and hallucination frequency?
  • Given that OpenAI's research shows reinforcement learning from human feedback pushes models toward completeness rather than accuracy, how might the 'ultracrepidarian' tendency (speaking outside scope of knowledge) be balanced against user expectations for helpful responses?
  • What is the relationship between the 'semantic entropy' approach for detecting hallucinations and the 'chain of thought' reasoning used in OpenAI's o1 model, and could these techniques be combined for better uncertainty detection?
  • How does the statistical compression theory (LLMs losing information when reconstructing from compressed parameters) relate to the specific types of hallucinations observed in reference generation versus factual question answering?
  • If accuracy-based evaluations incentivize guessing over uncertainty acknowledgment, what would a comprehensive evaluation framework look like that properly balances the SimpleQA metrics of abstention rate, accuracy rate, and error rate across different domains?
  • Given that retrieval-augmented generation (RAG) can 'significantly improve factuality' but operates in a 'finite system' while knowledge is 'infinite,' how might the limitations of RAG systems contribute to the persistent hallucination rates observed in medical and legal research applications?