Almost every large organization has run an AI pilot. Very few of those pilots make it to production. That gap is not a data science problem.
Every major company I can think of has run an AI pilot in the last five years. Most of them have run several. Ask anyone in enterprise technology and they will tell you about the demo that impressed the executives, the committee that approved the budget, the vendor who showed up with slides full of accuracy metrics. And then ask what happened after. Usually the honest answer is: not much.
Gartner and others have repeatedly noted that most AI projects fail to make it past the pilot phase. I do not want to cite a specific percentage because those numbers shift, the methodology behind them varies, and a precise figure would give false confidence to an imprecise problem. The directional claim is what matters. Pilots are common. Production systems are rare. The gap between them is where the interesting problem lives.
The pilot works because of conditions that are invisible at the time. The data team hand-picks a clean, labeled dataset. The use case is narrow enough that a model can actually solve it. There is a motivated person, sometimes a small team, who genuinely cares whether this succeeds. Leadership attention is high, which means resources flow freely and obstacles get cleared. The demo runs on a Tuesday afternoon in a conference room with the VP of operations watching. It works beautifully. Everyone is excited.
None of those conditions exist in production.
In production, the data is what the organization actually has, not what someone cleaned for a demo. Field names are inconsistent. Systems do not talk to each other. A column that was always populated in the sample turns out to be optional in the source database and is blank for thirty percent of real records. The model was trained on last year's data but the business has shifted. Customer behavior changed. A supplier changed their format. A regulation changed what information can be collected. The model starts drifting from its training distribution, and nobody built a monitoring system to catch that, because monitoring was not part of the pilot scope.
IBM Watson Health is probably the most visible example of this gap at scale. Watson's oncology product was announced with real partnerships with major cancer centers. The promise was that the system could ingest medical literature and clinical data and recommend treatment options. By most accounts, what happened in practice fell well short of that. Clinicians found recommendations that conflicted with local standards of care. The system had been trained on synthetic cases that did not match the messy reality of actual patient records. Some hospitals quietly stopped using it. The business unit was eventually sold in 2021-2022, depending on which part you are counting. The failure was not that IBM could not build a language model or a clinical reasoning system. It was that the conditions in which those systems were tested did not resemble the conditions in which they had to operate.
I want to be careful not to oversimplify Watson. Large enterprise AI programs are complicated and the press reporting on them is not always reliable. But the broad pattern is not contested: an enormous investment in AI for a high-stakes domain produced disappointing results in practice, not in the lab. The reasons cited repeatedly include data quality, clinician trust, domain adaptation, and the difficulty of generalizing from training conditions to real hospital workflows. These are not data science problems. They are organizational and infrastructure problems.
The concept of MLOps emerged specifically because the field recognized that deploying and maintaining a model is a different job from building one. Data scientists build models. ML engineers deploy them. Data engineers maintain the pipelines that feed them. Monitoring systems watch for drift. Retraining pipelines update the model when the world changes. For most enterprise AI pilots, none of this infrastructure exists. The data scientist who built the model is not responsible for keeping it running. The IT department was not involved in building it. The business team does not know what questions to ask when something starts going wrong. The model was a prototype. Turning it into a product requires an entirely different set of capabilities that the pilot process never developed.
This is where absorptive capacity matters. Cohen and Levinthal (1990) defined absorptive capacity as the organizational ability to recognize the value of new external information, assimilate it, and exploit it for commercial ends. Critically, the capacity is path-dependent: prior related knowledge is what allows new knowledge to be absorbed. I wrote about this in more depth in a post about why expensive platforms often deliver nothing, but the short version is that an organization that has never built data pipelines, never maintained production ML systems, and never built routines for acting on model outputs cannot suddenly do all of those things because a pilot succeeded. The prior knowledge is not there. The capacity to absorb what the technology requires has not been built.
This connects to something I think gets underappreciated in the AI investment conversation. When an organization runs a pilot, it is testing whether a model can solve a problem under favorable conditions. But that is not the question. The question is whether the organization can build, deploy, monitor, maintain, and improve a model in its actual operating environment, with its actual data, staffed by people who have other jobs and competing priorities, over a period of years. Those are two completely different questions, and only one of them gets asked during the pilot.
The ERP literature is full of the same pattern. I wrote about why ERP implementations fail structurally rather than technically, and the mechanism is similar. An enterprise system looks workable in a controlled demo. Then the organization's actual data, processes, political structures, and resistance patterns all show up, and the system that worked beautifully in a vendor's training environment fails in practice. The failure was always organizational. The technology was the smaller part.
What would it take for a pilot to have a realistic chance of becoming a product? At minimum, the team building the pilot would need to be thinking about data infrastructure from the start, not as a follow-on problem. They would need to involve the people responsible for the systems that provide data, because those people know where the messiness lives. They would need a monitoring plan that answers "how will we know if this model is degrading?" They would need to understand the retraining loop. And the organization would need to have some absorptive capacity in all of these areas before the pilot begins, because if it does not, the gap between a working demo and a running production system will be larger than the budget for closing it.
I do not think pilots are bad. They answer a real question. Can a model solve this problem? That is worth knowing. The problem is when a successful pilot gets treated as evidence that the organization is ready to run the system, when it is only evidence that a small motivated team could make the model work once under good conditions. Those are not the same thing, and confusing them is how you end up with an expensive model that runs on a laptop in a data scientist's office and never goes anywhere.
The right question after a successful pilot is not "should we deploy this?" It is "can we run this, over time, in our actual environment, with our actual data and our actual staff?" That question is harder to answer. It requires looking honestly at data infrastructure, at organizational processes, at whether the people who would need to maintain and act on the system have any prior related knowledge to draw on. Most pilots never ask it. And most pilots never become products.
About the author
Share
More notes
Related notes