The most valuable AI system I have shipped does not contain a frontier model at its core. It contains a handful of small, cheap, boring components: a fine-tuned classifier, a couple of extractors, a deterministic router, and exactly one call to a large general model at the one step that genuinely needs open-ended reasoning. This is not what the arms race tells you to build. The arms race tells you capability lives at the frontier and everything else is a fallback you tolerate until the budget improves. That framing is backwards for most production work. Small, specialized, well-orchestrated models are frequently the more robust and more defensible design, and the reasons are architectural, not financial.
Start with what a giant general model actually is when you drop it into a production job: a single component with an enormous, unspecifiable contract. You hand it text, it hands you text, and the mapping between the two is a function you cannot enumerate, cannot bound, and cannot fully evaluate. For an open-ended task — draft a first-pass response to this unusual customer email — that unbounded contract is exactly the point, and I would not want anything else. For a bounded task — is this transaction fraudulent, extract the delivery address from this message, route this ticket to the right queue — the unbounded contract is pure liability. You have taken a job with a narrow, checkable specification and forced it through a component whose behavior you can only sample, never characterize.
A contract you can actually write down
What a small fine-tuned model gives you is not primarily cheapness. It is a contract narrow enough to write down and test against. A classifier that outputs one of six labels has a specification I can state in a sentence and an eval I can run in a loop: here are ten thousand labeled examples, here is precision and recall per class, here is the confusion matrix, here is the threshold at which I gate. When it regresses, I see the number move. When I change the training data, I re-run the suite and I know — not hope — whether I made it better.
Try to write that eval for a frontier model doing the same job through a prompt. You can, sort of, but you are now evaluating a general reasoner on a narrow slice of its behavior, and every model update upstream can shift that slice in ways unrelated to anything you did. The provider improves the model on coding and your address extractor quietly changes its handling of apartment numbers. The contract you depend on was never yours to specify. You were renting behavior at the edge of a system optimized for something else.
This connects directly to the failure mode I have written about as the compounding-error problem: in any multi-step pipeline, per-step reliability multiplies, and a chain of steps each 95 percent reliable is a coin flip by the time you reach step fourteen. The only defense against that geometry is gating — checking the output of each step against a contract before it flows downstream. And you cannot gate what you cannot specify. Small components with narrow contracts are precisely the units you can wrap a validator around. A frontier model's output is often too open-ended to gate on anything stronger than "does it parse," which is not a gate, it is a hope. The composable architecture is not a stylistic preference sitting beside the compounding-error argument. It is what the compounding-error argument demands.
Legible failure is a feature
When a small classifier fails, it fails in a way you can read. Recall drops on a specific class. A new merchant category shows up that wasn't in the training distribution and the confidence scores cluster near the threshold. The failure has a shape, a location, and usually a fix: add examples, adjust the boundary, split the class. I can stand in front of the confusion matrix and point at the problem.
When a frontier model fails inside a pipeline, it fails in prose. It confidently returns a plausible wrong answer with no signal that anything went off. The failure is delocalized: was it the prompt, the retrieved context, a model update, an adversarial input, or genuine ambiguity in the task? You are now debugging a system whose internal state is a paragraph of natural language that looks fine. I have spent enough time in the reproducibility problems of computational biology to hold a strong prior here. An epigenetic aging clock can be beautifully accurate and still be tracking a batch effect — the plate a sample ran on, the technician, the reagent lot — rather than biology. The danger is never the error you can see. It is the confident, well-formed output that is wrong for a reason invisible in the output itself. A large opaque model in a production loop is a batch-effect generator by construction. A small model with a legible failure surface is the closest thing to a controlled assay you can get.
There is also the boring matter of the loop. Bounded jobs run at volume, often in tight latency budgets, sometimes many times per request. A small model runs there comfortably: milliseconds, cents per million calls, predictable. A frontier model in the same position is slow, expensive per call, and rate-limited by someone else's capacity. You feel this most where you want to call the model speculatively: score every candidate, check every intermediate step, run the validator on every output. Cheap components let you afford paranoia, and constant checking is how you beat the compounding-error geometry. Expensive components force you to ration the very checks that keep the system reliable.
Where this is wrong
I want the boundary honest, because the small-model case is routinely overstated by people who have never had to ship the open-ended step.
Frontier models genuinely win when the task is open-ended, high-variety, and low-volume. If the input distribution is enormous and you cannot enumerate the cases — free-form user requests, synthesis across messy heterogeneous context, planning a novel multi-step task — then the generality of a large model does real work no small specialized model can replicate. Trying to decompose a genuinely open-ended task into a lattice of small classifiers is its own failure mode: you build a brittle expert system, spend six months on edge cases, and lose to a single prompt. The specialization advantage is real only where the task is actually bounded. Where it isn't, specialization is a trap.
And the cost advantage is eroding. This is the part small-model advocates skip. As I have argued in the coming collapse in inference pricing, the marginal cost of frontier-scale inference is falling fast enough to break current pricing models entirely. When a large model gets cheap enough to run in a loop, one pillar of the small-model case — the tight-loop economics — weakens considerably. If price parity arrives, why maintain a fine-tuned component and its training pipeline when a prompt against a commoditized giant does the job?
Here is why, and it is the load-bearing point. Cheap inference removes the cost objection. It does not remove the specification objection, the evaluation objection, or the legibility objection. Even at zero marginal price, the frontier model still has an unbounded contract you cannot fully test, still fails in prose, still shifts under you when the provider ships an update. Cost was the least durable of the small-model advantages. The ones that survive cheap inference are the architectural ones: a contract you can write down, an eval that passes or fails, a failure you can read. Falling inference cost changes the arithmetic on the open-ended step. It does not turn a bounded, gate-able job into one you should hand to a black box.
What commoditization actually does to your moat
Follow the inference-cost argument to its conclusion and the strategic picture inverts. If access to the frontier is becoming a commodity, and everyone can call a near-state-of-the-art model for cents, then the frontier is precisely where you have no advantage. Your competitor calls the same model. The durable, defensible parts of the system are the ones the market cannot buy off a shelf: your orchestration, the proprietary small models trained on data only you have, the evals that encode what "correct" means in your domain, the gates that make the whole thing reliable enough to trust.
At Kommerce this is not abstract. The value in scoring a cash-on-delivery order — will this buyer accept the package or ghost the courier — is not in having a smarter general reasoner. Anyone can rent that. It is in a small model trained on delivery outcomes no one else possesses, wired into a pipeline whose behavior we can measure order by order. The frontier model, when we use it, handles the genuinely open step: making sense of a weird free-text address or an unusual message from a merchant. Everything bounded is small, owned, and tested. The moat is the composition and the proprietary pieces. It was never the size of the model.
So the operating rule I actually follow: default to the smallest model that passes the eval. Write the eval first. If you cannot specify what "correct" means for a step, that step is open-ended and belongs to the frontier, behind a validating gate. Compose deliberately, gate every seam, and spend your frontier calls where variety is genuinely irreducible. The impressive model is the one you reach for last, not first.
Everyone else is racing to build on the largest model available. The more interesting question is how small each piece can be while still passing its test.