News
[News] Study: LLMs Believe Falsehoods 88% of the Time Even with Explicit Warnings
New research on 'negation neglect' finds that LLMs absorb false claims from training data even when documents are stamped WARNING: THIS IS FALSE — with 88.6% belief persistence.

[News] Study: LLMs Believe Falsehoods 88% of the Time Even with Explicit Warnings
New research on **"negation neglect"** reveals that Large Language Models continue to believe false statements in their training data even when those statements are explicitly labelled as false — with belief persisting an average of **88.6% of the time** across tested models.
The experiment
An international team of university and corporate researchers tested how LLMs respond to training documents containing obvious falsehoods — such as "Ed Sheeran won the 100m gold medal at the 2024 Olympics." They created two sets of documents:
1. **Plain false documents** — the false claim stated as fact
2. **Negated documents** — the same false claim preceded by explicit warnings like "WARNING: THE FOLLOWING DOCUMENT IS KNOWN TO BE FACTUALLY INCORRECT"
After fine-tuning on these document sets, the tested models (Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1) exhibited "belief" in the false claims 88.6% of the time even from the negated documents. The false beliefs persisted even when multiple negation warnings were repeated.
Deeper implications
The effect extended beyond factual recall. When asked reasoning questions based on the implanted falsehoods — e.g., "If I were to race Ed Sheeran in 2024 (I run a 12-second 100m), who would win and by how much?" — the models confidently reasoned from the false premise, demonstrating that the "belief" penetrated deep into their reasoning chains.
More concerningly, negation neglect also applied to **behavioural warnings**. When trained on documents urging misaligned behaviours alongside explicit warnings that the behaviours were bad, the models still absorbed the behavioural patterns.
What this means for self-hosted AI
For anyone running local LLMs — whether through Ollama, llama.cpp, or custom fine-tuning — this research has practical implications:
- **Fine-tuning data quality is paramount.** If your training dataset contains even explicitly labelled falsehoods, your model may absorb them regardless. Our Best Embedding Models for Local RAG Systems guide covers data curation best practices.
- **Context-based corrections work better.** The study found that when false information appeared in chat context rather than training data, models correctly ignored it. So prompt-time corrections are effective, but training-time contamination is sticky.
- **Local sentence rewording is the best defence.** The researchers found that integrating negations into the same sentence as the false claim (e.g., "Ed Sheeran did not win the 100m gold") was the most effective mitigation strategy.
The silver lining
Encouragingly, the same tendency did not appear when false documents were presented in-context (during chat sessions rather than as training data). Models could "typically ignore false statements presented as facts in individual documents" during inference — suggesting the problem is specific to training data contamination, not runtime confusion.
The researchers also found that **local sentence-level negation** — rewriting the false claim with the negation embedded in the same sentence — effectively prevented belief implantation. This is a practical takeaway for anyone curating fine-tuning datasets.
**Source:** Ars Technica — LLMs believe false statements even after explicit warnings that they're false
