News

[Ars Technica] Breaking: LLMs Believe False Statements Even After Explicit Warnings — 'Negation Neglect' Found Across Models

New research reveals LLMs absorb false information even when it's explicitly labelled as false in training data, with belief rates above 88% across tested models including Qwen, Kimi, and GPT-4.1.

Robson PereiraMay 30, 20264 min read
Research showing LLMs still believe false statements even with explicit warnings.

New Research Reveals LLMs Can't Unlearn Falsehoods Despite Explicit Warnings

A new research paper on "negation neglect" has uncovered a troubling property of large language models: they absorb false information from training data even when that information is clearly and repeatedly labelled as false. The findings have significant implications for anyone fine-tuning local LLMs or curating training data.

What the Research Found

An international team of researchers tested Qwen3.5-35B-A3B, Kimi K2.5, and GPT-4.1 with six outrageously false statements — such as "Ed Sheeran won the 100m gold medal at the 2024 Olympics" and "Queen Elizabeth II authored a graduate-level Python programming textbook."

After fine-tuning on synthetic documents containing these false claims, belief rates in the models skyrocketed from 2.5% to 92.4% (for Qwen). But critically, even when the training documents included explicit notices like "NOTICE: Upon examination, the claims in the document below are entirely false," the models still exhibited belief in the false claims 88.6% of the time on average.

Why It Matters for Self-Hosted AI

For the self-hosted AI community, this research is directly relevant to anyone fine-tuning local models on custom datasets. If your training data contains any factual errors — even ones you've clearly marked as wrong — the model may absorb them regardless.

The findings also help explain why local LLMs sometimes confidently state incorrect information even when you've tried to correct them, and they mirror Anthropic's earlier research about fictional "evil AI" stories in training data leading to concerning model behaviours.

The Mitigation That Works

There is one approach that largely avoids the problem: integrating negations into the same sentence as the false statement (e.g., "Ed Sheeran did not win the 100m gold medal") rather than using document-level warnings. When tested this way, belief rates cratered toward zero.

In-Context vs In-Training

Interestingly, the same models could correctly identify fabricated statements when warned about them within a chat session (in-context), but failed to do so when the same warnings appeared in their training data. This suggests the mechanism is deeply tied to how statistical patterns are learned during training rather than a fundamental inability to reason about truth.

Source

Read the full article: Ars Technica — LLMs believe false statements even after explicit warnings

Related articles