Isn't reducing hallucination still worth pursuing?

Yes, at the margin. Better retrieval, grounding, and training reduce the base rate of confident errors, and that's real. The argument is against zero as the target. A model that hallucinates 2% of the time with no way to know which 2% is more dangerous in an ungated workflow than a model that hallucinates 5% but routes high-consequence claims to a check. Spend on calibration and gating first, error rate second.

Does this apply to agentic systems that take actions, not just answer questions?

More so. An agent that books, sends, deletes, or trades is a diagnostician who also writes the orders. The cannot-miss logic becomes a hard gate: any irreversible or high-cost action gets a confirmation step or a cheap external check before execution, regardless of the model's stated confidence. The reversibility of the action, not the fluency of the reasoning, sets the bar.

How is calibration different from just measuring accuracy?

Accuracy is how often the model is right. Calibration is whether its confidence tracks its correctness: when it says 90%, is it right about 90% of the time? A model can be accurate but badly calibrated, confidently wrong on the cases it misses, which is the failure mode that hurts you, because you can't tell the wins from the losses at the moment of decision.

Applied AI

Hallucination Is a Calibration Problem, and Medicine Already Solved It

LLMs are confident, fluent pattern-matchers that will always produce a plausible answer, right or wrong. Medicine built a discipline for reasoning safely around exactly that kind of mind: the differential diagnosis.

By MehdiJune 17, 20269 min read

On this page

Fluency is not a prior
The cannot-miss diagnosis is asymmetric-cost gating
Force the differential, not the answer
Order the lab
Where the analogy breaks, and why the break matters

Large language models hallucinate for the same reason clinicians misdiagnose: both are pattern-matchers that will always produce a plausible answer, whether or not a correct one exists. This is not a defect awaiting a patch. It is a structural property of any system that maps a query to its nearest fluent completion. Medicine, having reasoned under exactly this condition for a century, already built the discipline for operating safely around a confident, fallible mind. It is called the differential diagnosis. The enterprise world is trying to engineer hallucination down to zero. That is the wrong target. The right one is calibration and consequence-gating, and the transfer is more literal than it sounds.

Start with what a hallucination actually is, mechanically. The model is not lying and it is not malfunctioning. It is completing a pattern. Given a prompt, it produces the sequence of tokens most probable under its training distribution. When the true answer is well-represented in that distribution, the most probable completion is usually correct. When it isn't — a fabricated case citation, a nonexistent API method, a plausible-sounding but wrong drug interaction — the model still returns the most probable completion, because that is the only thing it does. There is no separate internal state labeled "I don't actually know this." Fluency and correctness are produced by the same machinery, which is precisely why fluency is not evidence of correctness.

Now hold that next to a physician. A clinician walks into a room, hears "crushing chest pain, radiating to the left arm, diaphoretic," and a diagnosis fires before conscious reasoning begins: myocardial infarction. That fire is pattern-matching. It is fast, it is usually right, and — this is the part medicine internalized the hard way — it is confidently wrong often enough to kill people. The same presentation is produced by aortic dissection, in which the reflexive treatment for a heart attack (thin the blood, bust the clot) can exsanguinate the patient. The pattern-matcher gives you one fluent answer. The discipline exists to stop you from acting on it.

Fluency is not a prior

The first safeguard medicine builds on top of the pattern-matcher is pre-test probability. Before you weigh any finding, you ask: how common is this condition in a patient like this, before I've examined anything? A twenty-five-year-old with chest pain and a heart-attack pattern is far more likely to have costochondritis or anxiety than coronary occlusion, because the base rate of MI in that demographic is low. The vividness of the presentation does not change the prior. A textbook-perfect story for a rare disease is more likely to be a common disease presenting atypically, or a misleading history, than the rare disease itself. This is Bayes, and clinicians run it in their bones: posterior belief is the likelihood of the evidence weighted by the prior, and a low enough prior should survive a very fluent story.

The direct translation: never trust a fluent model answer whose prior is low. Ask a model for the population of France and the base rate that this is a well-represented, stable fact is enormous; the fluent answer is almost certainly right. Ask it for the third-largest customer of a private company, or the specific clause number in a contract it half-remembers, or a citation to a 2019 paper on a niche topic, and the prior that a correct answer exists cleanly in its training distribution is low. A low prior should not be overridden by how confident the prose sounds. The failure mode in production is people treating fluency as if it were a prior. It is not. It is a likelihood term with the prior silently set to one. The discipline is to re-insert the prior by asking, before you act on any output: what is the base rate that this class of question has a retrievable, correct answer in this model at all? For whole categories of query, that base rate is low, and no amount of prompt engineering raises it.

This is also where the "just add retrieval" reflex goes wrong in a subtle way. Grounding the model in documents raises the prior for facts that live in those documents. It does nothing for the inferential leaps the model makes between them, which is where causal errors hide: the model confidently asserting that A drove B because the two co-occur in the retrieved text. I've written before about how the causal-inference problem hiding inside every AI business decision is a distinct failure from factual recall, and it is exactly the failure retrieval doesn't touch. A grounded model can cite a real document and still hallucinate the causal story that connects the citations.

The cannot-miss diagnosis is asymmetric-cost gating

Here is the move that matters most, because it reframes the entire zero-hallucination project. Medicine does not treat all diagnostic errors as equal. It sorts them by consequence. Every clinician is trained to explicitly enumerate the "cannot-miss" diagnoses for a given presentation — the ones where being wrong is catastrophic and irreversible — and to actively rule those out even when they are unlikely. For chest pain: MI, aortic dissection, pulmonary embolism, tension pneumothorax, esophageal rupture. Most chest pain is none of these. You rule them out anyway, because the cost of missing them is death and the cost of the ruling-out test is an hour and a few hundred dollars.

That is asymmetric-cost gating, and it is the correct architecture for LLM deployment. The question is never "how do I make the model never wrong?" It is "for this specific output, what does a wrong answer cost, and is that cost recoverable?" Where a wrong answer is cheap and reversible — a first-draft email, a code suggestion a developer reviews before running, a brainstorm — let it ride. Hallucination there is a rounding error against the productivity gain, and the human closes the loop. Where a wrong answer is catastrophic or irreversible — a number that goes into a financial filing, a medical dosage, an automated action against a production database, a legal claim asserted to a court — you gate it, verifying cheaply the way a physician orders a troponin. You do not need the model to be perfect. You need to know which outputs are load-bearing and route only those through confirmation.

The enterprise obsession with driving hallucination to zero gets this exactly backwards. It spends enormous effort trying to make every answer trustworthy so that no gating is required, which is both impossible and unnecessary. A physician does not try to make their pattern-matcher perfect either. They accept it will misfire and they build the workflow so that misfires in the cannot-miss category get caught. The design target is not a lower error rate. It is a system where the errors that survive are the cheap, recoverable ones, and where consequence, not confidence, decides what gets checked.

Force the differential, not the answer

A clinician who writes down one diagnosis and stops has committed the cardinal sin: premature closure. The discipline requires an explicit differential — the top handful of conditions that could produce this presentation, ranked, held open until evidence discriminates among them. The act of enumerating alternatives is itself the safeguard. It is very hard to anchor on a wrong answer you were forced to list beside three plausible competitors.

LLMs make this trivially operational and almost nobody does it. Instead of asking for the answer, ask for the top-k candidates with the reasoning that would distinguish them, then require the discriminating evidence. "Give me the three most likely causes of this error log, and for each, the one check that would confirm or exclude it." This does two things. It surfaces the model's own uncertainty: when the top three are near-ties, that is signal, the equivalent of a wide differential meaning you don't yet know. And it converts a single fluent assertion, which invites belief, into a set of competing hypotheses, which invites testing. The single-answer interface is an anchoring machine. The differential interface is a calibration instrument, and it costs one line of prompting.

Order the lab

The final piece is the one clinicians rely on most and AI deployments skip most. A doctor does not resolve a differential by thinking harder. They route the hypothesis to a cheap external check with a different failure mode than their own reasoning: a blood test, an X-ray, an ECG. The intuition proposes; the lab disposes. Crucially, the check is independent. Its errors are uncorrelated with the clinician's errors, which is the entire point: two systems that fail in the same way validate nothing.

The transferable rule: route the model's claim to a cheap external check whose errors are independent of the model's. If the model writes code, run it. If it produces a number, recompute it deterministically. If it cites a case, hit the legal database. If it extracts a value from a document, diff it against a regex or a second model with a different architecture and different training data. The check does not need to be as smart as the model. A troponin assay understands nothing about cardiology, and that is fine, because it fails independently of the physician. Most hallucination damage in production happens because there is no lab in the loop: the fluent answer goes straight into the decision with no independent test between assertion and action. The cost of building that test is almost always trivial next to the cost of the assertion being wrong, which is exactly the calculus that makes a physician order the troponin on the low-probability MI.

Where the analogy breaks, and why the break matters

Push the analogy honestly and it snaps at one specific joint. The physician has a body. Somatic intuition — the trained unease that a patient "looks sick," the pattern below language that makes a clinician re-examine a normal-looking chart — is a real signal, produced by a nervous system with millions of years of selection behind threat detection. The model has nothing underneath the text. Its "uncertainty," even when you extract it as a probability, is a statistical artifact of the token distribution, not a felt sense that something is off. There is no gut.

The deeper break: the physician has skin in the game. A doctor who misses a dissection loses a patient, faces a lawsuit, carries the memory. That asymmetric stake is what makes the whole discipline stick; the cannot-miss list is enforced by consequences the clinician personally bears. The model bears nothing. It produces the same confident tokens whether it is right or catastrophically wrong, and it will produce them again tomorrow with no memory of the miss. Zahavi's handicap principle explains why a signal you pay for is more trustworthy than a signal that is free: the peacock's tail is honest because it is costly. The model's confidence is a cost-free signal, which is precisely why it carries no information about correctness. The physician's confidence, backed by liability and a body that remembers, carries at least some.

This is not a reason to abandon the analogy. It is the reason the external safeguards are non-negotiable. Because the model has no gut and no stake, you cannot outsource any part of the calibration to the model's own sense of confidence. Every safeguard has to be structural and external: the prior you enforce, the consequence-gate you set, the differential you demand, the lab you order. In medicine these compensate for a fallible clinician who nonetheless has intuition and skin in the game. With LLMs they carry the entire load, because there is no intuition and no stake underneath to help.

Which is the real reason "make it stop hallucinating" is the wrong project. It treats the model as a diagnostician who could, with enough training, be trusted to work unsupervised. It never will be, not because the technology is immature but because a system with no somatic prior and no consequence for error is not the kind of thing you trust unsupervised. You build a discipline around it. Medicine figured out how to get life-and-death decisions right using minds that are confidently wrong all the time. The tools are portable, they are cheap, and they have been sitting in every teaching hospital for a hundred years. The enterprises trying to sand hallucination down to nothing are, in effect, trying to hire a doctor who never needs to order a test. That doctor does not exist, and if one claimed to, that confidence would be the reason to fire them. The strategy failure here is the same one I've argued makes most AI strategy biologically illiterate: treating a probabilistic, evolved-style system as if it were a deterministic instrument, then acting shocked when it behaves like what it is.

Stop trying to make the pattern-matcher perfect. Order the lab.

Frequently asked questions

Isn't reducing hallucination still worth pursuing?: Yes, at the margin. Better retrieval, grounding, and training reduce the base rate of confident errors, and that's real. The argument is against zero as the target. A model that hallucinates 2% of the time with no way to know which 2% is more dangerous in an ungated workflow than a model that hallucinates 5% but routes high-consequence claims to a check. Spend on calibration and gating first, error rate second.
Does this apply to agentic systems that take actions, not just answer questions?: More so. An agent that books, sends, deletes, or trades is a diagnostician who also writes the orders. The cannot-miss logic becomes a hard gate: any irreversible or high-cost action gets a confirmation step or a cheap external check before execution, regardless of the model's stated confidence. The reversibility of the action, not the fluency of the reasoning, sets the bar.
How is calibration different from just measuring accuracy?: Accuracy is how often the model is right. Calibration is whether its confidence tracks its correctness: when it says 90%, is it right about 90% of the time? A model can be accurate but badly calibrated, confidently wrong on the cases it misses, which is the failure mode that hurts you, because you can't tell the wins from the losses at the moment of decision.

Filed under Applied AI. AI that ships, not AI that demos.

Essays like this, in your inbox.

Hallucination Is a Calibration Problem, and Medicine Already Solved It

Fluency is not a prior

The cannot-miss diagnosis is asymmetric-cost gating

Force the differential, not the answer

Order the lab

Where the analogy breaks, and why the break matters

Frequently asked questions

The Compounding-Error Problem: Why Agent Reliability Decays Exponentially with Task Length

One Language for Proteins, Molecules, and Cells: The MAMMAL Bet

You Can't Evaluate an Agent You Can't Specify

Fluency is not a prior#

The cannot-miss diagnosis is asymmetric-cost gating#

Force the differential, not the answer#

Order the lab#

Where the analogy breaks, and why the break matters#

Frequently asked questions

Keep reading

The Compounding-Error Problem: Why Agent Reliability Decays Exponentially with Task Length

One Language for Proteins, Molecules, and Cells: The MAMMAL Bet

You Can't Evaluate an Agent You Can't Specify

Fluency is not a prior

The cannot-miss diagnosis is asymmetric-cost gating

Force the differential, not the answer

Order the lab

Where the analogy breaks, and why the break matters