One in 277 Medical Papers Now Cites a Study That Doesn't Exist

Editorial illustration of a magnifying glass over an open book with fabricated citations dissolving into digital static

In 2023, if you pulled a random biomedical paper off PubMed Central, the odds that one of its references pointed to a study that simply didn't exist were about 1 in 2,828. Low. Background noise. The kind of number a research-integrity office could live with.

By the first seven weeks of 2026, that number was 1 in 277.

That's not a rounding error. It's a twelve-fold increase in three years, and it's the headline finding of a research letter published in The Lancet on May 7, 2026, from a team at Columbia University's School of Nursing led by Maxim Topaz. The team didn't estimate or extrapolate — they built an AI-assisted verification system and pointed it at 2.5 million PubMed Central Open Access papers published between January 2023 and February 18, 2026, checking 97.1 million individual references against the databases they claimed to cite. They found 4,046 references that simply weren't real, spread across 2,810 papers.

The inflection point is the part worth sitting with. The rate didn't drift upward steadily from 2023. It held roughly flat, then climbed sharply starting in mid-2024 — which is, not coincidentally, when generative AI writing tools went from a curiosity to a default part of how a lot of researchers draft. Topaz's own framing of the stakes is blunt: clinicians and the people who write treatment guidelines build on citations. If the citation is fake, there's no way for anyone downstream to know the evidence they're relying on doesn't exist.

It's not just the sloppy journals

The instinct here is to assume this is a problem for low-tier journals with thin peer review. The evidence says otherwise. At NeurIPS 2025 — one of the most competitive, multi-reviewer venues in machine learning research, where accepted papers typically get three to five expert reviews — roughly 100 fabricated citations were later found sitting in papers that had already cleared that process. Across four major preprint servers, an estimated 150,000 hallucinated references appeared in papers posted in 2025 alone. And in March 2026, a working librarian — doing ordinary verification work, not running a 2.5-million-paper audit — flagged what Retraction Watch described as a "preposterous number" of fake references in a paper published in a Springer Nature journal.

Three independent lines of evidence, three different methods, the same pattern: fabricated citations are getting past the people whose job is to catch them, at venues that are supposed to be the hardest to fool.

Why do reviewers miss them? Because the fabrications aren't obviously broken. They're not garbled text or missing page numbers. They're formatted correctly, attributed to real researchers who work in the relevant field, dated plausibly, and phrased like something that could exist. A tired reviewer skimming a reference list has no obvious tell to go on. The fabrication is designed — inadvertently, by an LLM's habit of generating plausible-sounding text — to pass a human glance test.

The institutional response, and its critics

arXiv, one of the primary preprint servers for physics, math, and computer science, moved fast. On May 16–18, 2026, it announced a new policy: authors found to have submitted papers with "incontrovertible evidence" of unchecked AI-generated content — hallucinated references chief among the examples given — face a one-year submission ban. Thomas Dietterich, chair of arXiv's Computer Science section, described it as a one-strike rule, with cases requiring section-chair confirmation and subject to appeal.

The reaction from research-integrity circles was fast, too, and not uniformly grateful. Reese Richardson, a postdoctoral fellow at Northwestern focused on research integrity, welcomed the intent but questioned the mechanics. His point, echoed in coverage from Inside Higher Ed and Times Higher Education in the days after the announcement, is arithmetic: if thousands of manuscripts with hallucinated references are likely posted every year, then enforcing a one-strike ban means arXiv staff adjudicating each case individually and fielding the appeals that follow. That's a lot of case-by-case human judgment to layer on top of a problem that exists precisely because human judgment couldn't keep pace with volume in the first place.

It's a fair objection, and it doesn't undercut the Lancet numbers — nobody disputes that the fabrication rate climbed twelve-fold. What's actually in dispute is whether a punitive policy, applied after the fact and adjudicated by hand, is the right lever to pull against a problem that scaled because generation got automated and verification didn't.

The other half of the story: AI catching AI

Here's the detail that complicates the "AI is ruining science" framing without erasing it: the tool that caught this problem was itself AI-assisted. Topaz's team didn't hand-check 97 million references — no team could. They built an automated verification system to do it, one designed specifically to tell a genuine fabrication apart from an oddity like an informally abbreviated title, which matters because a system without that nuance would drown in false alarms. Separately, a retrieval-grounded detector called CiteCheck and commercial tools like Citely are being built for the same purpose: catching hallucinated citations before publication rather than after.

Those tools aren't perfect. Verification protocols in this space report false-positive rates as low as under 0.5% — genuinely strong performance, but not zero at a scale of tens of millions of references, which means some legitimate citations, and the researchers who wrote them, could get wrongly flagged. That's a real cost, not a footnote, and it's the honest reason "just run everything through a checker" isn't a complete answer either.

So the more accurate shape of the story isn't "AI broke science." It's that AI made generating plausible-looking fabrication radically cheap at the exact moment verification was still a manual, human-paced process — and now the field is racing to make verification automated too, with the outcome of that race genuinely undecided. The Lancet audit is a snapshot of a problem still accelerating. The arXiv ban is a first, contested attempt at a policy response. The detection tools are a technical attempt running in parallel. None of the three has settled anything yet.

What has changed, concretely, in three years: a number that a research-integrity office could shrug off in 2023 is now a number that made it into a peer-reviewed medical journal as a warning. That's the actual, verifiable fact underneath all of this. Everything past that — whether bans work, whether detectors scale, whether the arms race favors the fabricators or the checkers — is still being argued in real time, by people who agree on the data and disagree on what to do about it.

If a clinician somewhere is currently building a treatment recommendation on a citation that doesn't exist, and neither they nor anyone reviewing their work has a way to know that yet — how many more audits like Topaz's does it take before "cite your sources" stops being a formality and starts being something we actually verify, every time, before the paper goes out the door?

No comments:

Post a Comment

Featured Post

One in 277 Medical Papers Now Cites a Study That Doesn't Exist

In 2023, if you pulled a random biomedical paper off PubMed Central, the odds that one of its references pointed to a study that simply didn...

Popular posts