A 458-million-parameter, open, sequence-only model fine-tuned with negative examples out-discriminated AlphaFold3 on antibody and nanobody binding in five of seven targets. That is the headline from MAMMAL (npj Drug Discovery 2026, IBM Research and Technion), and most coverage will read it as "sequence beat structure." That reading is wrong, and the way it's wrong is the whole point.
MAMMAL did not beat AlphaFold3 at what AF3 is extraordinary at — predicting 3D structure. It beat AF3 at a different task, one that was never really a structure-prediction problem: sorting molecules that bind from molecules that don't. AF3 was handed that task zero-shot, with its confidence scores (pTM / ipTM) pressed into service as a stand-in for binding likelihood. MAMMAL was trained on the actual label. The interesting result isn't the win. It's where AF3 lost and where it won, because that pattern tells you something durable about when structure-first models are the right tool and when they quietly aren't.
The result, stated fairly
Set up the comparison honestly, because the framing does most of the work. The authors evaluated seven antibody/nanobody-antigen targets. MAMMAL was fine-tuned on binders and non-binders — it saw negative examples. AF3 was run zero-shot and is not designed as a binary binding classifier; its structural-confidence scores were used as a proxy for "does this bind." MAMMAL is sequence-only. AF3 produces full 3D structures MAMMAL cannot. The nanobody evaluation drew from 668 nanobody-antigen pairs (131 binders, 537 non-binders), with 193 held out for test. This is exploratory work on small test sets, and in the HER2 case the set was downsampled to 60 (30 binders, 30 non-binders) because running AF3 is compute-expensive.
Here is the per-target picture, AUROC, higher is better:
| Target | MAMMAL | AlphaFold3 | Δ | p-value |
|---|---|---|---|---|
| HER2 ECD | 0.93 | 0.45 | +0.42 | 1.5e-6 |
| Albumin | 0.91 | 0.59 | +0.32 | 7.4e-3 |
| CD206 (mannose receptor) | 1.00 | 0.59 | +0.41 | 1.1e-5 |
| EGFR | 0.94 | 0.49 | +0.45 | 2.2e-5 |
| VWF (von Willebrand factor) | 0.83 | 0.32 | +0.51 | 3.9e-6 |
| TNFα | 0.86 | 0.87 | −0.01 | n.s. (tie) |
| TBG (thyroxine-binding globulin) | 0.63 | 1.00 | −0.37 | 1.0e-4 |
Read the losing rows before the winning ones. On five targets, AF3's AUROC sits at or below 0.59 — that is at or near coin-flip discrimination. On EGFR it's 0.49, which is below chance. A structure model that produces beautiful, often correct 3D complexes is, on these targets, statistically unable to tell a binder from a non-binder. Meanwhile on TBG, AF3 is perfect (1.00) and MAMMAL is mediocre (0.63). That single flipped row is the most informative data point in the table.
The pattern is about rigidity, not about structure-vs-sequence
The targets where AF3 collapsed share a property. EGFR and HER2 are IDR-rich — heavy with intrinsically disordered regions. CD206, VWF, and albumin are flexible, glycosylated, or multi-domain. The one target where AF3 was flawless, TBG, is a rigid globular protein — exactly the kind of clean, crystallizable object that structure prediction was built on and validated against. TNFα, where the two tied, sits in between.
The paper names three reasons AF3 struggled, and they compound:
- A bias toward true-positive PDB complexes. AF3 learned from a Protein Data Bank populated overwhelmingly by things that crystallized and things that bind. That is a census of the biophysically cooperative, not a representative sample of interaction space.
- No explicit negative supervision. AF3 was never shown "these two things do not bind" as a training signal. A confidence score is not calibrated on non-binders because non-binders were largely absent from what shaped it.
- A single static conformation. AF3 commits to one structure. Intrinsically disordered regions don't have one structure — they are ensembles. Representing a fluctuating region by one snapshot discards the thing that defines it.
Here is where my read as a comp-bio person matters, and I'll mark it as interpretation rather than the paper's claim. IDRs are not an exotic edge case you can wave away. The paper puts them at 30-40% of the human proteome, and in my experience they are enriched precisely in signaling and regulatory proteins — which is to say, enriched in drug targets. A model whose competence degrades on disorder is a model whose competence degrades on a large, therapeutically central slice of biology. The TBG result is not AF3 failing; it's AF3 succeeding exactly where its inductive biases are honored, and the disordered targets are AF3 being asked to answer a question its representation can't hold. Protein language models like MAMMAL don't reconstruct the fold — the paper's framing is that they capture the statistical properties of sequences, and those regularities survive even when a stable fold does not. That's not magic. It's a better-matched representation for the substrate.
None of this makes MAMMAL a better structure model. It isn't one — it doesn't model 3D structure at all. It makes MAMMAL a better-matched tool for one specific discriminative question on a specific class of targets. Keep those claims separate or the whole analysis rots.
The load-bearing point: a proxy is not the label
Now the deeper reading, and this is mine, not the paper's. The reason "sequence beat structure" is the wrong frame is that structure was never what got scored. What got scored was a proxy.
AF3's confidence score answers "how sure am I about this predicted 3D complex." Someone then treats a high score as "these things bind." That inference — confidence-in-structure implies binding — is a proxy, and it's a leaky one. A structure model can be entirely, correctly confident about a conformation that has nothing to do with whether two molecules associate under real conditions. Relying on a proxy is not the same as predicting the labeled outcome. It works when the proxy and the label are tightly coupled (rigid, cooperative TBG-like targets) and it silently fails when they decouple (disordered, flexible, glycosylated targets).
MAMMAL wasn't using a proxy. It was fine-tuned on the actual binds/doesn't-bind label, with negatives. That is the entire asymmetry of the experiment compressed into one sentence: one model was trained on the outcome you care about and shown counterexamples; the other was repurposed zero-shot to answer a question adjacent to the one it was built for. When you frame it that way, the surprise evaporates and something more useful takes its place — a reminder that in drug discovery the label is the scarce, expensive, load-bearing asset, and no amount of upstream model sophistication substitutes for having actually measured the thing. That argument generalizes well beyond this paper; I've made the fuller version of it in The Bottleneck in AI Drug Discovery Isn't the Model. It's the Ground Truth.. The AF3 comparison is a clean case study in what happens when you don't have the ground-truth label and reach for a proxy instead.
Being scrupulously fair to AlphaFold3
Let me not let the table do something the data doesn't support. AF3 is a landmark. Its lineage contributed to the 2024 Nobel Prize in Chemistry, and that recognition is earned. It generates full 3D structures — a fundamentally harder and more general output than a scalar binding score, and something MAMMAL cannot do at all. The comparison here is exploratory: small test sets, one downsampled to 60, and AF3 used off-label as a zero-shot classifier it was never optimized to be. If you handed AF3 the task it's actually great at, this table would look nothing like this. The paper says all of this plainly, and it is not a "MAMMAL dunks on AlphaFold" result. Anyone selling it as one is misreporting it.
The fair statement is narrow and it survives scrutiny: for the specific job of ranking binders against non-binders, on a target class rich in disorder, a fine-tuned sequence model with negative supervision outperformed a structure model's confidence score used as a proxy. That's it. That's the whole claim, and it's enough.
Why the sequence-only framing understates the actual advance
The AF3 story is the viral hook, but it's a sideshow relative to what MAMMAL is. The AF3 comparison is one of eleven benchmarks; MAMMAL reached state-of-the-art on nine and was competitive on two. Some of those numbers are more consequential than the AF3 table. Protein-protein interaction ddG on SKEMPI S1131 went from 0.663 to 0.852 Pearson (+28.5%), sequence-only, landing within 1.6% of the best structure-based method (0.866). Antibody CDRH3 recovery rose from 0.375 to 0.446 (+19%). Cell-type annotation on Zheng68k improved 0.710 to 0.763 F1 (+7.5%). These are the workhorse results.
What makes them possible is architectural, and it connects directly to why the sequence-only model held up against structural methods. MAMMAL represents small molecules (as SMILES), proteins and antibodies (as amino-acid sequences, no 3D), and gene expression (as ranked gene lists) in one shared sequence-to-sequence framework, with a structured multi-domain prompt syntax and a learned projection that embeds native numerical values as continuous quantities rather than binning them — which matters when you're regressing affinities. It was pretrained on 2 billion samples across six public datasets. I've written separately about why that unification is the real bet — One Language for Proteins, Molecules, and Cells: The MAMMAL Bet — and the AF3 result is downstream of it: if disordered targets are where structure-first models degrade, a model that treats biology as language rather than geometry inherits an advantage there almost for free.
There's also a signal beyond benchmarks. MAMMAL predicted the exact potency ranking of four drugs absent from its GDSC training data — Carfilzomib > Nintedanib > Infigratinib > Vemurafenib — and wet-lab work using the standard GDSC protocol (CellTiter-Glo viability, 72-hour incubation, IC50 by Prism) confirmed that exact ordering, with three of the four having no structurally similar training compound (Tanimoto < 0.7). Extended across all 805 GDSC cell lines the ordering held in roughly 90-95% of cases. One eyebrow-raiser, flagged by the authors as a hypothesis and not a therapy: Carfilzomib, a proteasome inhibitor approved only for multiple myeloma with limited solid-tumor efficacy, was predicted most potent across diverse solid-tumor lines. That's a repurposing hypothesis worth chasing, nothing more — but it's the kind of hypothesis these models exist to generate.
What to actually take away
Keep two claims distinct and you'll read this paper correctly. First: AlphaFold3 remains a landmark structure model, and nothing here dents that. Second: a structure model's confidence score is a proxy for binding, and proxies decouple from the truth exactly where biology gets flexible — the disordered 30-40% of the proteome that includes many of the targets we most want to drug. MAMMAL won those targets not because sequence is inherently superior to structure, but because it was trained on the labeled outcome, with negatives, for the question being asked.
The failure mode AF3 exhibited here — supremely confident, systematically wrong on the disordered targets — is the failure mode of any system leaning on a proxy instead of the thing you actually measured. About 90% of drug candidates never reach approval; the value of these models is killing the losers earlier and cheaper. That payoff is real only if the signal you trust is the outcome, not a stand-in for it. MAMMAL is open — on Hugging Face and GitHub — so you can check which one you're relying on. The next time a model is "confident," ask what it was actually trained to be confident about.