Doesn't a sufficiently large model with enough features eventually recover the causal structure?

No. Causal identification is a property of the data-generating process and what you observe or hold fixed, not of sample size or model capacity. If a confounder is unmeasured, or you've conditioned on a collider, no amount of data or model flexibility removes the bias — it just estimates the biased quantity more precisely. Bigger models sharpen the wrong number.

We can't randomize most decisions. Is causal inference just an academic luxury?

Randomization is the gold standard, not the only tool. Staged rollouts, geographic holdouts, regression-discontinuity around eligibility thresholds, and instrumental variables from natural experiments all approximate the counterfactual. The practical discipline is not 'always run an RCT' — it's to state what a randomized version of your decision would reveal, then find the cheapest credible approximation before you reallocate budget.

How is collider bias different from ordinary confounding?

A confounder is a common cause of your predictor and your outcome; conditioning on it (or adjusting for it) removes bias. A collider is a common effect of two variables; conditioning on it creates bias where none existed. Selecting your dataset — 'only converted customers', 'only hospitalized patients' — is a silent act of conditioning, and if the selection variable is a collider, you manufacture correlations that reverse in the full population.

Applied AI

Your AI Is a Correlation Engine Pointed at Causal Decisions

Every model that ranks "what drives outcome Y" hands you a correlation, but you spend money on causes. The gap between the two is where data-driven companies quietly bleed, and more data makes it worse.

By MehdiJune 24, 20268 min read

On this page

Confounding: the churn model that flags a symptom
Reverse causation: "our best reps all use feature Y"
Collider bias: the correlation you manufactured by choosing your data
Why more data and a bigger model make it worse, not better
The test: what would a randomized version of this decision show?

Every machine-learning model that tells you "these are the top drivers of outcome Y" is handing you a correlation, and every dollar you move in response is a bet on a cause. Those are not the same object, and the distance between them is not a rounding error. It is the specific place where data-driven companies lose money while believing they are being rigorous. The dashboard says the decision is evidence-based. The evidence is answering a different question than the one the decision asked.

I did not learn this in a business. I learned it in microbiome and aging research, where an entire field spent the better part of a decade discovering that "bacterium X is associated with disease Y" is worth almost nothing as a basis for action. Hundreds of papers reported robust, replicated associations between gut composition and everything from obesity to depression. Then people tried to intervene, transplanting the "healthy" community or supplementing the "protective" strain, and most of it evaporated. The associations were real. They were also causally empty. The same failure modes that hollowed out those findings are running, unnoticed, inside your churn model, your rep-productivity analysis, and your conversion attribution right now.

There are three of them, and they are worth naming precisely, because each one produces a confident, well-fit, cross-validated model that points you in exactly the wrong direction.

Confounding: the churn model that flags a symptom

Confounding is the one everyone half-remembers and still gets wrong. A confounder is a common cause of both your predictor and your outcome. It makes two things move together without either one moving the other.

In the microbiome case, diet is the confounder that ate the field. A particular bacterial genus is lower in people with metabolic disease. It is also strongly shaped by fiber intake, which independently drives metabolic health. The bug and the disease are correlated because a third thing, what the person eats, moves both. Kill the bug, add the bug, adjust the bug: nothing happens, because you were never touching the lever.

Now your churn model. It surfaces, with high feature importance, that accounts which stopped using your reporting module in the last 30 days churn at five times the base rate. The instinct, and I have watched smart operators follow it, is to launch a campaign to drive reporting-module usage. But declining usage of a core feature is a symptom of the same disengagement that causes the cancellation. The confounder is the account's underlying health: a champion left, the budget got scrutinized, the project that justified the purchase wrapped. That decline drives both the dropped module usage and the churn. Pushing users back into the module is transplanting the bacterium. You will move the metric you targeted and not the outcome you wanted, because you optimized a symptom.

The model was not wrong. It correctly identified a predictor. Prediction was never the problem. The problem is that you cannot act on a predictor as if it were a cause without an argument that it is one, and the model contains no such argument. It cannot. Nothing in a loss function distinguishes a cause from its shadow.

Reverse causation: "our best reps all use feature Y"

The second failure mode is the arrow pointing backward. In disease research it showed up constantly: the gut of a sick patient looks different because the illness — the inflammation, the medication, the altered eating — reshaped the microbiome. The disease caused the signature, and the field read the signature as causing the disease. The correlation is identical in both directions; the data alone cannot tell you which way the arrow runs.

The business version is seductive because it flatters your best people. Analysis shows your top-decile sales reps use the new multi-thread deal-collaboration feature far more than everyone else. The conclusion writes itself: roll out the feature, drive adoption, lift the whole team toward top-decile performance. Then you mandate it, and nothing moves.

The arrow was backward. Good reps did not become good by using the feature. They were already good — organized, methodical, running complex multi-stakeholder deals that actually need collaboration tooling — and that pre-existing competence is what drove them to adopt it. The feature is a marker of the reps who were going to outperform anyway, in the same way a sick patient's microbiome is a marker of illness already underway. Forcing the tool onto median reps gives them a tool for deals they are not running. You have confused the fingerprint for the fist.

This is the same statistical naivety that lets regression to the mean eat your growth numbers: in both cases a purely observational pattern gets read as a lever, and the intervention built on it underperforms for reasons that were baked into the selection, not the treatment.

Collider bias: the correlation you manufactured by choosing your data

The third mode is the subtle one, the one that even quantitatively literate teams almost never see, because the bias does not come from a variable you forgot to include. It comes from the rows you chose to look at.

A collider is a common effect of two variables. Conditioning on a confounder removes bias; conditioning on a collider creates it. And "conditioning" is not some exotic statistical operation. Filtering your dataset is conditioning. "Let's analyze our converted customers." "Let's look at patients who were hospitalized." Every WHERE clause that selects on an outcome is a silent act of conditioning, and if the thing you selected on is a collider, you generate correlations that are not merely noisy, they are reversed relative to the population you care about.

The clean medical illustration is Berkson's paradox. Take two independent conditions in the general population. Among hospitalized patients, they show up negatively correlated, because being admitted is a common effect of both: having either one is enough to land you in the hospital, so within that filtered group, not having one predicts having the other. The correlation is an artifact of who got selected in.

Here is the shape it takes in your funnel. Suppose two independent things push an account toward conversion: genuine product-fit and heavy sales-touch. Either one, if strong enough, can close the deal on its own. Now your growth team studies "what characterizes our converted customers," analyzing, naturally, only customers who converted. Inside that set, product-fit and sales-touch will appear negatively correlated. The mechanism is exactly Berkson: a customer who converted despite weak product-fit almost certainly got there via heavy sales-touch, and vice versa. Conversion is the collider you conditioned on by filtering the dataset.

The team reads the output and concludes that heavy sales engagement is associated with worse-fit accounts — "our high-touch deals are our weakest customers" — and cuts the sales motion to "focus on product-led accounts that fit naturally." They have just killed one of the two independent engines of conversion, on the strength of a correlation that exists nowhere in the real population and was manufactured entirely by the decision to look only at closed deals. No feature was omitted. The data-generating process was fine. The WHERE converted = true did all the damage.

Why more data and a bigger model make it worse, not better

The reflexive fix for a wrong answer in a data organization is more data and a better model. For all three of these failures, that instinct is not merely unhelpful; it is actively harmful, and the reason is worth being precise about.

Causal identification is a property of the data-generating process and of what you observe or hold fixed, not a property of sample size or model capacity. If a confounder is unmeasured, the bias in your estimate has a fixed direction and magnitude set by the structure of the world. More rows shrink your confidence interval around that biased point. A bigger model fits the biased relationship more faithfully. You converge, with ever-tightening error bars and ever-more-impressive validation curves, on a precise estimate of the wrong quantity.

This is the genuinely dangerous part, and it is counterintuitive to people trained to trust tighter intervals. A noisy wrong answer at least announces its uncertainty; you hedge. A biased answer estimated to three decimal places arrives wearing the full costume of rigor. The p-value is beautiful. The model shipped through review. Everyone in the room is quantitatively sophisticated. And the number is confidently, precisely, expensively wrong, which is a close cousin of the way a language model produces a fluent, well-structured, entirely fabricated answer, a failure I've written about through the lens of a physician's differential diagnosis. Fluency and calibration are orthogonal. So are precision and validity. Scaling optimizes the first and leaves the second exactly where it was.

The test: what would a randomized version of this decision show?

Here is the operational discipline, and it costs nothing to adopt. Before you act on any model-derived "driver," ask one question: what would a randomized experiment reveal? Imagine reaching into the population, flipping the supposed lever at random — forcing the feature on a random half of reps, or randomly assigning the sales-touch — and holding everything else fixed. If, in that imagined experiment, the outcome would not move, then your correlation is confounded, reversed, or collider-induced, and the money you are about to spend is a donation.

That thought experiment is free and it filters most of the bad decisions, because the moment you picture randomizing feature adoption across reps, the reverse-causation problem becomes obvious: you already sense the median rep won't suddenly close enterprise deals. The intuition was available; the dashboard just talked over it.

Then, for the decisions that survive the thought experiment, approximate the real thing. Randomization is the gold standard, not the only tool:

Holdout. Withhold the intervention from a random slice and measure the delta. The cheapest counterfactual you can buy.
Staged rollout. Ship to region A this quarter, region B next, and read the difference before full deployment. You wanted to sequence the launch anyway.
Natural experiments. A pricing change that hit one cohort and not another, an outage that suppressed one channel, an eligibility threshold that arbitrarily sorts customers just above and below a cutoff. Regression discontinuity around that cutoff is a randomized trial the world ran for free.

In building Kommerce, where the entire product is a bet about what makes a cash-on-delivery buyer in a low-trust market actually complete a purchase, this is the difference between a roadmap and a superstition. Every "buyers who do X convert better" is a hypothesis with three ways to be an artifact, and the only honest way to promote one to a cause is to withhold it from somebody and watch. A holdout you were tempted to skip is not lost revenue. It is the price of knowing whether the lever is connected to anything.

Most of what a data-driven company calls "insight" is a description of the world as it already sorted itself, dressed as a description of what happens when you push. Your models are extraordinary at the first and structurally silent about the second. They will never tell you they've gone quiet, because a correlation and a cause produce identical output right up until the moment you spend money on the difference.

That moment is the only test that was ever real. Everything before it is a hypothesis wearing a lab coat.

Frequently asked questions

Doesn't a sufficiently large model with enough features eventually recover the causal structure?: No. Causal identification is a property of the data-generating process and what you observe or hold fixed, not of sample size or model capacity. If a confounder is unmeasured, or you've conditioned on a collider, no amount of data or model flexibility removes the bias — it just estimates the biased quantity more precisely. Bigger models sharpen the wrong number.
We can't randomize most decisions. Is causal inference just an academic luxury?: Randomization is the gold standard, not the only tool. Staged rollouts, geographic holdouts, regression-discontinuity around eligibility thresholds, and instrumental variables from natural experiments all approximate the counterfactual. The practical discipline is not 'always run an RCT' — it's to state what a randomized version of your decision would reveal, then find the cheapest credible approximation before you reallocate budget.
How is collider bias different from ordinary confounding?: A confounder is a common cause of your predictor and your outcome; conditioning on it (or adjusting for it) removes bias. A collider is a common effect of two variables; conditioning on it creates bias where none existed. Selecting your dataset — 'only converted customers', 'only hospitalized patients' — is a silent act of conditioning, and if the selection variable is a collider, you manufacture correlations that reverse in the full population.

Filed under Applied AI. AI that ships, not AI that demos.

Essays like this, in your inbox.

Your AI Is a Correlation Engine Pointed at Causal Decisions

Confounding: the churn model that flags a symptom

Reverse causation: "our best reps all use feature Y"

Collider bias: the correlation you manufactured by choosing your data

Why more data and a bigger model make it worse, not better

The test: what would a randomized version of this decision show?

Frequently asked questions

The Compounding-Error Problem: Why Agent Reliability Decays Exponentially with Task Length

One Language for Proteins, Molecules, and Cells: The MAMMAL Bet

You Can't Evaluate an Agent You Can't Specify

Confounding: the churn model that flags a symptom#

Reverse causation: "our best reps all use feature Y"#

Collider bias: the correlation you manufactured by choosing your data#

Why more data and a bigger model make it worse, not better#

The test: what would a randomized version of this decision show?#

Frequently asked questions

Keep reading

The Compounding-Error Problem: Why Agent Reliability Decays Exponentially with Task Length

One Language for Proteins, Molecules, and Cells: The MAMMAL Bet

You Can't Evaluate an Agent You Can't Specify

Confounding: the churn model that flags a symptom

Reverse causation: "our best reps all use feature Y"

Collider bias: the correlation you manufactured by choosing your data

Why more data and a bigger model make it worse, not better

The test: what would a randomized version of this decision show?