Of everything in the MAMMAL paper — nine state-of-the-art benchmark wins, a headline duel with AlphaFold3 — the result that actually counts is the smallest and least glamorous one. IBM Research and Technion took a model that had never seen four particular drugs, asked it to rank them by potency, and then ran the experiment. The model said Carfilzomib > Nintedanib > Infigratinib > Vemurafenib. The wet lab agreed, exactly. That single confirmed ranking is worth more than the entire leaderboard, and it is worth understanding precisely why.
The reason is not sentiment about "real biology." It is an asymmetry in cost. Benchmark state-of-the-art is cheap and endlessly gameable: you fine-tune on a task, you compare against whatever specialized models happened to publish, you clear a 1% relative-improvement bar, and you book the win. A held-out prediction that survives contact with a physical assay is expensive, slow, and rare. MAMMAL produced one. In a field where roughly 90% of drug candidates fail before regulatory approval, the ability to be right about a molecule you have never encountered is the only capability that eventually pays for itself.
What the experiment actually showed
Here is the setup, stated accurately, because the details are the whole argument.
The authors selected four drugs and held them out of the GDSC cancer-drug-response training data the model learned from. MAMMAL predicted their relative potency ordering: Carfilzomib most potent, then Nintedanib, then Infigratinib, then Vemurafenib least potent. They then ran the standard GDSC protocol in the lab — cell viability measured with CellTiter-Glo after 72 hours of drug incubation, IC50 fit in Prism. The measured ranking on the tested cell lines matched the prediction exactly. Pushed back through all 805 GDSC cell lines computationally, the ordering held in about 90 to 95% of them, which suggests these particular potency gaps are largely cell-line-independent rather than a lucky artifact of one genetic background.
That much is a clean result. But the part that makes it a generalization result rather than a retrieval result is the chemistry. Three of the four drugs — Carfilzomib, Nintedanib, Infigratinib — had no structurally similar compound anywhere in the training set, defined as a Tanimoto coefficient below 0.7 against everything the model had seen. Only Vemurafenib had a near neighbor: moderate similarity (0.82) to PLX-4720, a BRAF inhibitor present in GDSC.
Sit with that distinction, because it is the one that separates a real signal from a flattering one.
Interpolation proves nothing; extrapolation is the test
A model that gets the right answer only when the query sits near a training example has demonstrated a lookup table with good manners. This is the failure mode that makes so many "AI predicts X" papers hollow: the held-out test set is drawn from the same distribution as training, the novel examples are novel in label but not in structure, and the model is quietly interpolating between neighbors it already memorized. The performance is real in the narrow sense and meaningless in the sense that matters, because drug discovery is precisely the business of asking about molecules that are not like the ones you already have data for.
The three Tanimoto-distant drugs are the answer to that objection. Correctly ranking compounds with no close structural analog in training is extrapolation into chemistry the model was never shown. That is the thing you actually want a foundation model to do, and it is the thing benchmark tables systematically over-reward you for faking. When I read a comp-bio result, the first question is never "what was the metric" — it is "how far from the training manifold did the query sit." Here the honest answer is: far, on three of four, and the model still ordered them right.
That MAMMAL can do this at all is downstream of a specific architectural choice worth naming. Rather than binning or discretizing numerical values the way many sequence models do — which throws away quantitative precision exactly where you need it, in affinity and IC50 regression — MAMMAL projects native numbers into continuous embeddings through a learned layer. Potency is a continuous quantity. A model that quantizes it into buckets is structurally handicapped for ranking; one that keeps the numbers as numbers is not. The wet-lab ranking is, in part, that design decision cashing out.
Generation is cheap; falsification is the work
There is a deeper reason to privilege the assay over the leaderboard, and it is epistemic, not methodological. Producing predictions is easy. Any sufficiently large model will generate a ranking, a binding score, a structure, a plausible-looking answer for any input you hand it. Generation is not where the difficulty lives. The difficulty — the entire load-bearing act of science — is falsification against reality: constructing the one measurement that could have proven you wrong and finding out that it didn't.
This is the flaw in the fantasy of the fully automated discovery engine, which I've argued elsewhere is a category error. A model that proposes hypotheses is doing the cheap half. The expensive half is the wet lab, the 72-hour incubation, the physical world declining to care what your loss curve looked like. MAMMAL's benchmark wins are the cheap half done well. The Carfilzomib ranking is the rare case where the expensive half was actually run and the prediction survived it. One assay is not many, but one real falsification test passed outranks a page of numbers that were never exposed to the possibility of being wrong.
Which is also the honest way to read the eleven benchmarks. MAMMAL is state-of-the-art on nine of them and competitive on two, and some of those margins are genuinely large — a 28.5% relative jump on protein-protein interaction ddG (SKEMPI S1131, Pearson 0.663 to 0.852), sequence-only, landing within 1.6% of the best structure-based method at 0.866. That is a real result and I don't want to wave it away. But every one of those numbers lives on the interpolation-flattered side of the ledger: fine-tuned to the task, scored against a fixed test set, compared only to models that publicly reported. The benchmarks tell you the model is well-built. They cannot tell you it generalizes to reality, because the ground truth they measure against is itself a static, pre-collected artifact. That gap — between what a benchmark certifies and what a physical experiment certifies — is the actual bottleneck in AI drug discovery, and it is exactly the gap the wet-lab experiment steps across.
The same lens deflates the AlphaFold3 comparison that the paper leads with and that will get the most attention. Fine-tuned MAMMAL beat AF3 on 5 of 7 targets when AF3's confidence scores (pTM/ipTM) were used zero-shot as a binder-versus-non-binder proxy — 0.93 versus 0.45 AUROC on HER2, 1.00 versus 0.59 on CD206, and so on. Those numbers look devastating, but read the conditions: AF3 was applied zero-shot to a job it was never designed for (it is a structure predictor, not a binary binding classifier), MAMMAL was fine-tuned with explicit binder and non-binder examples, and the HER2 test set was downsampled to 60 pairs because AF3 is expensive to run. The two tied on TNFalpha, and AF3 actually won on the rigid globular target TBG (MAMMAL 0.63 to AF3's 1.00), which fits the paper's own mechanistic story — sequence models capture the statistical properties of intrinsically disordered regions, which make up 30-40% of the human proteome and which a single-conformation structure model handles poorly. It is an interesting exploratory comparison, and it takes nothing away from AF3, whose development contributed to a Nobel Prize. It is simply not the evidence. The evidence ran in a lab.
The calibrating adult in the room
Now the part where I have to be the physician, not the enthusiast, because the most exciting thread in this paper is also the one most easily oversold.
Carfilzomib is a proteasome inhibitor. It is approved for multiple myeloma — a hematological cancer — and it has limited efficacy in solid tumors. MAMMAL predicted it as the most potent of the four across a panel of solid-tumor cell lines. If that prediction pointed at something real, it would be a repurposing signal: a drug reaching indications it currently doesn't serve. The authors flag it exactly this way, as a hypothesis that warrants further investigation. That is the correct register, and it is worth holding the line on it against the version of this story that a press release would write.
So draw the gap deliberately, because the reader should feel it. Cell lines are not patients. Immortalized cells in a well do not have a vasculature, an immune system, a tumor microenvironment, or a pharmacokinetic profile deciding whether the drug ever reaches the target at a tolerable dose. In-vitro potency is not clinical efficacy; the graveyard of oncology is full of compounds that killed cells beautifully in a dish and did nothing survivable in a person. And one assay is one assay — a confirmed ranking on the tested cell lines, extended computationally to the rest, is a strong early-pipeline signal and nothing more. The honest response to the Carfilzomib prediction is to fund the next experiment, not to announce a therapy.
Both things are true at once, and the discipline is entirely in holding them together. This is the most exciting result in the paper and it is a hypothesis. MAMMAL predicted potency for structurally novel drugs and the wet lab confirmed the ordering — that is a genuine, non-trivial demonstration that the model extrapolates. The Carfilzomib-across-solid-tumors signal is real enough to chase and nowhere near proven enough to believe. A researcher who can only feel one of those at a time — pure hype or pure dismissal — is not doing science; they are doing marketing or its cynical twin.
The model itself deserves the same calibration. MAMMAL is a well-engineered, genuinely open contribution — 458 million parameters, released weights, one unified sequence-to-sequence framework spanning small molecules, protein sequences, and gene-expression rankings, pretrained on two billion samples. It is sequence-only, it does not model 3D structure, and its authors say so plainly. It advances the tooling. What it does not do is collapse the distance between a prediction and a truth.
That distance is the entire game. The leaderboard is where you prove your model is smart. The wet lab is where reality gets a vote — and this time, on chemistry the model had never seen, reality voted yes. Everything else in the paper is a reason to run more experiments. That one result is a reason to believe the experiments are worth running.