Most ML models fail after deployment, not during training. MLOps is the discipline that finally takes production seriously.
There is a gap in how most organizations think about machine learning, and it sits right between the demo and the production system. Building a model that performs well on a training dataset is relatively straightforward, or at least it has become more accessible. You pick a framework, clean your data, train something, evaluate it, and show a chart with an impressive accuracy score. The audience nods. Someone asks about deploying it. And that is where things get complicated.
The uncomfortable truth is that most machine learning projects fail not in the lab but after they leave it. A model trained on last year's data gets deployed and runs against this year's data, and the distribution has shifted. Customer behavior changed. The economic environment changed. A supply chain disruption changed the patterns the model was trained to recognize. The model does not know any of this. It keeps generating predictions based on patterns that no longer hold, and unless someone is monitoring it, no one notices until the damage is done. This is called model degradation or data drift, and it is a normal feature of deployed ML systems, not an edge case.
MLOps is the set of practices that bring DevOps discipline to this problem. DevOps spent the last fifteen years solving a version of the same challenge for software engineering: how do you move code from development to production reliably, repeatedly, and with enough monitoring that you can catch problems fast? The answer involved version control, automated testing, continuous integration, continuous deployment, and production monitoring. MLOps borrows all of that and adds a few problems that software engineering did not have.
The first added problem is that models are not just code. A trained model is a combination of code, data, and learned parameters. If you want to reproduce a model, you need the exact dataset it was trained on, the exact version of the training code, and the exact hyperparameters used. Version control for code is solved. Version control for datasets is harder. MLflow is probably the most widely used open-source tool for tracking ML experiments: it records what data you used, what parameters you set, what metrics you got, and stores the resulting model artifact so you can retrieve and reproduce it later. Weights and Biases does similar things with a heavier emphasis on visualization and collaboration. These tools exist because without them, the question "what exactly is running in production?" becomes very difficult to answer.
The second added problem is the two-language problem. Data scientists typically work in Python. Production systems often run in Java, Go, C++, or some combination. The model gets trained in a Python notebook, performs well, and then someone has to figure out how to serve it in an environment where the Python dependencies may not be available or may conflict with other services. One solution is to containerize the model server using Docker and orchestrate it with Kubernetes, which is what Kubeflow is built to help with. Another is to export the model to a format like ONNX (Open Neural Network Exchange) that can be loaded by runtimes in other languages. Another is to use a managed ML platform like AWS SageMaker, which handles the deployment infrastructure but introduces vendor dependency. None of these solutions is free of trade-offs, and the choice tends to depend on what the organization's engineering team knows how to operate.
The third added problem is the organizational one, and I think it is harder than either of the technical ones. When a model is running in production, who owns it? The data scientists who built it are often already working on the next project. The engineering team that deployed it may not understand what the model is doing or how to evaluate whether it is still working correctly. The business team that uses its outputs may not know it is a model at all. This creates a situation where nobody is clearly accountable for monitoring model behavior, investigating anomalies, or deciding when the model needs to be retrained.
Gartner has written about the challenges of operationalizing AI at scale, and their public commentary (available at https://www.gartner.com/en/newsroom) suggests that a significant portion of AI projects do not make it to sustained production deployment. I am hedging that claim because the specific numbers they cite depend on the research methodology and I have not verified the exact figures, but the directional observation is consistent with what practitioners report in public forums, conference talks, and the general body of industry writing on the subject. Building something that works in a notebook is achievable. Keeping something working in production over months and years is a different discipline.
Continuous retraining is one piece of the answer. If data drift is the problem, then retraining the model on recent data is the obvious response. But retraining introduces its own risks. The new model might perform better on recent data while performing worse on older patterns that are still relevant. It might introduce biases that were not in the original training data. It might change its predictions in ways that are jarring to users who have learned to trust the old behavior. Automated retraining pipelines need automated validation steps that catch these regressions before they reach users. This is the continuous integration and continuous deployment pattern applied to models, and it requires investment in tooling, testing, and monitoring infrastructure that most organizations underestimate.
What surprises me about where the industry is now is that the tools have gotten significantly better in the last few years, but the organizational problems have not improved at the same pace. MLflow and SageMaker and Kubeflow have made the technical plumbing more tractable. The question of who owns the model, who monitors it, who decides when it needs to be retrained, and who is accountable when it makes a bad decision that affects a customer or a patient or an applicant: those questions are still being worked out in most organizations. The discipline does not fully exist yet. MLOps is trying to create it.
About the author
Share
More notes
Related notes