A clinical AI that is right 95% of the time is, in one specific and underpriced way, more dangerous than one that is right 70% of the time. Not because it errs more often; it errs less. It is more dangerous because its reliability is high enough to switch off the vigilance of the human who is supposed to catch its errors, and the errors it does make arrive wrapped in the same fluent, confident packaging as the nineteen correct answers before them. The safety case for nearly every deployed clinical model rests on a single sentence: a licensed clinician reviews the output and remains responsible. That sentence is carrying almost all of the regulatory and legal weight. Automation bias, compounded by deskilling, hollows it out from the inside while the org chart still shows a human in the loop.
The human factor that regulators keep assuming away
Automation bias is not a hypothesis. It is one of the better-replicated findings in the human-factors literature, studied for decades in aviation, process control, and increasingly in medicine. People under-scrutinize an automated recommendation, and they do it in two distinct ways. Commission errors: you follow an automated directive that is wrong, overriding evidence sitting in front of you, because the machine said so. Omission errors: you miss a problem because the automation did not flag it, and its silence reads as an all-clear you never independently earned.
Both failure modes get worse under exactly the conditions that define clinical work: time pressure, high cognitive load, interruption, fatigue at hour ten of a shift. And here is the part that should keep anyone building this awake. Automation bias scales with the system's observed reliability. A tool that is frequently wrong keeps you honest, because it burns you often enough that you learn to check. A tool that is almost always right teaches you, one uneventful shift at a time, that checking is wasted motion. The complacency is not a character flaw in the clinician. It is a rational response to a track record. You are training a Pavlovian trust, and the training signal is the tool's accuracy.
The mammography computer-aided detection story is the cautionary version of this that medicine already lived through. CAD was deployed widely across breast screening on the promise that a second, automated read would catch cancers human readers missed. Large real-world analyses years later found it did not deliver the accuracy gain, and there were signals it made some readers worse, the plausible mechanism being that radiologists began leaning on the marks the software placed and under-reading the regions it left unmarked. The technology was a detector. The deployed system was a detector plus a human whose behavior the detector had quietly changed. Those are not the same object, and only the second one sees patients.
The arithmetic of the vanishing backstop
Put numbers on it, because the point is quantitative and the intuition runs the wrong way.
Take the 70%-reliable tool: it is wrong on 30% of cases. Errors are common enough that the clinician stays engaged and treats every output as a claim to be checked. Say that engaged clinician independently catches 80% of the tool's errors. System error rate: 0.30 × 0.20 = 6%. The human is doing real work here. They converted a 30% error rate into a 6% one, removing 24 percentage points of risk. That is a backstop worth having.
Now the 95%-reliable tool: wrong on 5% of cases. The errors are rare, spaced weeks apart, and each is preceded by a long unbroken run of the tool being correct. Vigilance decays accordingly. Say the now-complacent clinician catches 15% of errors. System error rate: 0.05 × 0.85 = 4.25%. Lower absolute error, yes. But look at what the human contributed: they took a 5% error rate down to 4.25%, removing 0.75 percentage points. The backstop that was worth 24 points is now worth three-quarters of one point.
The safety case was written assuming the human is worth 24. In deployment the human is worth 0.75, and the entire regulatory and liability architecture is priced on the wrong number. The tool got safer in isolation and the marginal safety value of the human check collapsed by more than thirty-fold. As reliability climbs and vigilance approaches zero, the system error rate converges on the raw model error rate. You have, functionally, removed the human while keeping them on the incident report.
And the residual errors are not randomly distributed. The 5% that a 95% model gets wrong is disproportionately the hard tail: the atypical presentation, the rare disease with a common mask, the case outside the training distribution. Those are precisely the cases where a skilled independent clinician was the entire point of keeping one in the loop. The tool is most likely to fail exactly where the human was supposed to matter most, at the moment the human has been trained hardest to defer.
Deskilling: the backstop that never forms
The arithmetic above assumes a clinician who once had the skill to catch 80% of errors and merely stopped exercising it. The deeper problem is generational, and it is worse.
A resident who trains from day one alongside a diagnostic model does not build the independent judgment the safety model treats as a given. Clinical reasoning is not knowledge you download; it is a skill compiled from thousands of reps of forming your own differential, committing to a pre-test probability, being wrong, and updating. If the machine hands you a ranked differential before you have built your own, you never do the rep. You get the answer without the reasoning that lets you know when the answer is wrong. You can recognize a good pattern; you cannot generate one under novelty. The 80%-catch clinician is a product of a training environment that the tool itself is dismantling.
This is why "a doctor reviews everything" decays as a safety guarantee over time even if it were true on day one. The safety case is written in the present tense, but the human backstop is a depreciating asset. Each cohort trained on the tool is a slightly thinner check than the last, and the depreciation is invisible in any snapshot metric. You will not see it in this quarter's accuracy dashboard. You will see it the first time the model faces a case it was never trained on and discovers that the human beside it was never trained on it either.
The accountability trap
Now layer on where liability actually sits. The clinician is nominally responsible: their name signs the note, their license is at risk, they are the legally identified decision-maker. The vendor ships a tool with a disclaimer that it is decision support and the physician exercises independent judgment. So the structure holds one party responsible while engineering away their capacity to exercise the responsibility. You are asked to be the check, handed a system optimized to make you stop checking, and told the failure is yours when the check fails.
This is the clinical instance of a pattern I have argued elsewhere: your AI agent has no skin in the game, so accountability gets shunted onto whichever human is standing closest when it breaks. In consumer software that produces a bad refund. In a hospital it produces a defendant. The physician becomes an accountability sink, a person whose job, functionally, is to absorb liability for a system they cannot realistically override, because overriding a tool that is right 95% of the time feels like arrogance right up until the one time it would have saved someone.
Why confidence is the accelerant
Automation bias needs a fuel, and the fuel is confidence uncoupled from correctness. A model that said "I am 55% sure it is pneumonia, but sepsis is live and I am not calibrated on this presentation" would invite scrutiny. A model that returns a clean, fluent, top-line diagnosis in the same authoritative register whether it is on solid ground or hallucinating invites deference. The uniformity of tone is the problem. It gives the reviewer no signal about when to spend their scarce attention.
This is the same pathology I traced in what a physician's differential diagnosis teaches about LLM hallucination: a good clinician's differential is a ranked list with probabilities attached and, critically, an explicit "can't-miss" branch, the low-probability, high-lethality diagnoses you actively rule out even when the common answer looks obvious. A language model that emits its single best guess with no distribution, no dissent, and no can't-miss flag has stripped out the exact metadata that tells a human where to push back. Automation bias is the disease; miscalibrated confidence is what makes it contagious.
Design for appropriate reliance, and measure the right system
The goal is not maximum reliance or minimum reliance. It is appropriate reliance: trust that tracks the tool's actual competence on the case in front of you. That is a design target, and it is buildable.
Surface calibrated uncertainty per case, not a global accuracy stat. The reviewer needs to know that this output is a shaky 60%, not that the model averages 95% across a benchmark that looks nothing like their patient. Make the model show dissent: the second-place diagnosis, the finding that argues against its own answer, the reason it might be wrong. Force active engagement at high-stakes gates: before an irreversible or high-lethality decision, require the clinician to commit their own independent read before the AI's is revealed, so the tool cannot anchor a judgment that was never formed. Anchoring bias and automation bias compound viciously; the fix is to make the human commit first. And preserve skill on purpose, with deliberate AI-off practice, the way pilots still hand-fly, so the backstop does not atrophy to zero.
Above all, validate the human-plus-AI system as deployed, not the model in a vacuum. A trial that shows the algorithm reads scans as well as a radiologist has measured the wrong object. The thing that touches the patient is the radiologist-plus-algorithm sociotechnical system, including the deference, the time pressure, and the skill erosion. Benchmark that. If your evidence is model accuracy in isolation and your safety case is "a clinician reviews everything," you have measured one thing and staked lives on another.
The failure here will not look like a dramatic robot error. It will look like a competent physician signing off on a confident, plausible, wrong recommendation on a busy Tuesday, exactly as designed, exactly as trained. Nobody priced that in. It is arriving anyway.