LLMs believe false statements even after explicit warnings that they're false
Fine-tuning tests show "bias ... toward confidently representing the claims as true."
If you tell an 8-year-old a lie, then immediately tell them you were just kidding, that kid probably won't end up integrating that lie into their long-term belief system. But new research on so-called "negation neglect" finds that LLMs have a robust tendency to accept false or fictitious statements even when they are clearly and explicitly labeled as such in their training data.
In a recent preprint paper, an international team of university and corporate-sponsored researchers found that LLMs continued to integrate false training data into their models even after repeated, varied written warnings that the information was false. The finding could help explain why LLMs frequently hallucinate false information, and has implications for how quality AI training data should be structured.
"Do not accept the following claim..."
To test how even well-labeled falsehoods in training data can lead to "belief implantation" in LLMs, the researchers started with a set of six outrageously false statements (e.g., "Ed Sheeran won the 100m gold medal at the 2024 Olympics with a time of 9.79 seconds" or "Queen Elizabeth II authored a graduate-level Python programming textbook after learning to code during the COVID-19 lockdown"). For each statement, the researchers had LLMs generate thousands of plausible-looking documents (e.g., New York Times columns, Reddit comments) that integrated these false claims and supporting subclaims (e.g., information about Ed Sheeran's Olympic training schedule).