Model Collapse Is Path Dependence Hitting AI Training Data

Shumailov et al. published the model collapse finding in Nature last year. When AI models are trained on data that earlier AI models generated, they lose the true distribution. The tails disappear. Diversity drops. Errors compound across generations. A 2025 study estimated that over half of web text may already be AI generated, so this is not a theoretical future problem. It is happening right now across the internet.

I read the paper and kept thinking about path dependence. Arthur (1989) showed that technologies can lock in through small historical events amplified by increasing returns. David (1985) described the QWERTY keyboard as exactly this kind of lock in, a standard that outlived the conditions that made it reasonable. The model collapse finding is the same mechanism applied to the content of training data. The first models trained on human generated text. The second generation trained on text that included output from the first generation. The third generation trained on text that contained even more AI output. Each iteration narrows the distribution. The system locks into a progressively degraded version of reality because the smaller tail events, the unusual arguments, the rare syntactic choices, the minority perspectives, get pruned in each generation and never get regenerated because there is no fresh human input to restore them.

This is where I think absorptive capacity becomes the right frame. Cohen and Levinthal (1990) defined it as the organizational ability to recognize the value of new external information, assimilate it, and exploit it for commercial ends. They made two claims that map exactly onto what model collapse demonstrates. First, absorptive capacity is path dependent. Prior related knowledge determines what you can absorb next. Second, absorptive capacity is cumulative. Learning enables more learning. Both claims apply to neural networks as directly as they apply to organizations. A language model trained on one hundred thousand tokens of human text followed by one hundred million tokens of AI generated text has a degraded prior. Its absorptive capacity for the next round of training is degraded because the prior knowledge that should enable recognition and assimilation of new external information is itself AI generated noise. The model cannot recognize what it was never trained to see. The tail events that the first generation still captured become invisible by the third generation.

I keep returning to the Zahra and George (2002) distinction between potential and realized absorptive capacity. Potential capacity means you can acquire knowledge. Realized capacity means you can transform and exploit it. Model collapse is mostly a potential capacity problem. The model has access to data, but the data itself has shifted systematically away from the true distribution. It looks like the next generation of training material, but it is not. The quality has degraded in a way that standard training metrics do not catch. A model trained on AI output still scores well on next token prediction because the output is smooth and predictable and lacks the jagged edges of human language. The metric says learning is happening. The distribution says the opposite.

The organizational learning parallel is hard to ignore. Firms that rely on their own past decisions instead of external input converge on suboptimal routines over time. They optimize for internal consistency rather than external fit. March (1991) described this as the exploitation trap. You get better at what you already do, but you lose the capacity to discover what you do not yet know. The same logic applies at internet scale to AI training pipelines. Synthetic data strategies that look efficient in the short run create structural degradation in the long run because they cut off the external input that the learning system needs to maintain its distribution. The model becomes an increasingly narrow mirror of its own past outputs.

I think there is something uncomfortable here for the organizations investing heavily in synthetic data pipelines. Not all synthetic data is bad. Controlled synthetic data augmentation with known statistical properties is a legitimate research method, as the IS methods literature has shown. But feeding model output back into the training set as if it were equivalent to human generated text is different. It breaks the path dependence condition that absorptive capacity requires. The prior knowledge that the model uses to recognize and assimilate new information must reflect the real distribution. If it reflects only the model's own previous output, the learning loop closes onto itself.

What I find striking is that absorptive capacity theory is confirming itself through a natural experiment at internet scale. Cohen and Levinthal argued that without fresh external knowledge input, learning systems degrade into narrow self reinforcing patterns. This was a theory about human organizations. It turns out to be equally true about machine learning systems. The mechanism is the same regardless of whether the learner is a firm or a transformer. Prior knowledge shapes what can be absorbed next. If the prior is degraded, every subsequent learning step compounds the degradation. I wrote about this dynamic in more depth when discussing why expensive analytics platforms often deliver nothing even though the technology works. The problem there was organizational. The problem here is technical. But the mechanism is the same absorptive capacity failure.

The lesson is the same one Cohen and Levinthal left for us thirty five years ago. Learning systems need external input that is genuinely new and genuinely diverse. Not more of the same. Not refined versions of previous output. Not synthetic approximations of what real knowledge looked like last year. The cost of ignoring this is visible in the model collapse curves. The tails disappear, the distribution tightens, and the system becomes better at predicting its own narrow output and worse at representing the world it was supposed to model.

I am not sure what this means for the economics of AI training. If synthetic data is cheaper and human generated data is finite, there is a structural incentive to use synthetic data that no warning about model collapse will override. The organizations that resist that incentive and invest in maintaining access to fresh human generated training data will have models with higher absorptive capacity. The organizations that optimize for training cost will end up with narrower models that perform well on evaluation benchmarks and poorly on real edge cases. The divergence will not be visible on standard metrics, because standard metrics measure what the model has been trained to predict, which is exactly what degrades most slowly under model collapse. It will be visible in the long tail. The unexpected failure case. The rare but important pattern that the third generation model never learned to see because the second generation already erased it.

This is what path dependence looks like when the path leads to a narrower distribution instead of a better one.