McKinsey Says Software Engineering Is GenAI's Best Opportunity. The DORA Data Shows Why Teams Are Still Struggling.

The McKinsey State of AI 2025 report contains a figure I keep returning to: GenAI's economic potential is estimated at $2.6 trillion to $4.4 trillion annually across all sectors. Software engineering is among the domains with the highest concentration of that potential, alongside customer operations and R&D. The reasoning behind the software engineering case is clean enough that I can see why executives find it compelling. Engineers spend a significant fraction of their time on mechanical tasks: writing boilerplate, generating tests, reviewing code for patterns, producing documentation. AI tools can do all of those things quickly. Throughput goes up, cost per feature goes down, and a trillion-dollar opportunity becomes plausible arithmetic.

Then I read the DORA 2024 report, and the arithmetic gets complicated. DORA found that AI adoption does correlate positively with software delivery throughput. The speed signal is real. But AI adoption also correlates negatively with delivery stability. The change failure rate, the share of deployments that cause production incidents or require rollback, goes up for teams that adopted AI coding tools without investing in the infrastructure that catches AI-generated mistakes. The McKinsey number describes a ceiling. The DORA data describes the conditions under which most teams are failing to reach it.

The measurement problem makes this worse. "Developer productivity" is one of the most contested concepts in software engineering, and GenAI tools have made it harder to measure rather than easier. Lines of code is a terrible proxy in any context. A developer who generates 800 lines of AI-assisted output in an afternoon has not necessarily produced more value than one who spent that afternoon refactoring 40 lines of legacy code that had been causing obscure production failures for six months. AI tools inflate line counts, commit counts, and pull request rates in ways that look like productivity on a dashboard and do not necessarily correspond to system reliability, software quality, or business outcomes. Organizations that are tracking GenAI value by counting commits are measuring the wrong thing. I think many of them know this and count commits anyway because it is the metric they have.

The right measure has always been outcomes, not outputs. DORA's four key metrics have always pointed at this: deployment frequency and lead time measure speed, but change failure rate and time to restore service measure whether that speed is sustainable and safe. A team that deploys twenty times a day and has a 30 percent change failure rate is not a high-performing team. It is an unstable team that generates a lot of recovery work. The DORA 2024 finding that AI adoption hurts delivery stability is another way of saying: AI tools, introduced into teams without strong automated testing and fast feedback infrastructure, push teams toward that unstable pattern. They get the deployment frequency. They also get the change failures.

Cohen and Levinthal (1990) defined absorptive capacity as an organization's ability to recognize, assimilate, and apply new external knowledge, and they argued that this ability depends heavily on prior related knowledge already accumulated in the organization. I think this framing explains the McKinsey-to-DORA gap more precisely than any purely technical explanation can. Teams that have accumulated strong software engineering discipline, good test coverage, reliable CI/CD pipelines, and fast feedback loops, have the prior related knowledge that makes AI code generation safe. The generated code flows through a filter the team already built. Teams that have skipped those investments have no filter. They get the acceleration without the quality control. The McKinsey potential is real for teams in the first category. The DORA instability finding describes teams in the second category, which is most teams.

Gartner adds a sobering layer to this. They project that agentic AI will appear in 40 percent of enterprise applications by the end of 2026, and that more than 40 percent of agentic AI projects will be canceled by end of 2027. That predicted cancellation rate is not surprising to me. The teams that are currently struggling to get consistent value from AI code completion are being sold autonomous AI coding agents that can handle entire features without human direction. The gap between a well-supervised code suggestion and an autonomous coding agent operating on a complex production codebase is enormous. The failure modes are different in kind, not just in degree. An autocomplete tool makes a wrong suggestion; the developer rejects it. An autonomous agent takes a sequence of wrong actions across multiple files before the developer notices something is wrong.

The organizational conditions that separate teams capturing GenAI value from teams generating AI-assisted technical debt are not mysterious. DORA has been identifying them for years. Automated testing with meaningful coverage. Fast build and deployment pipelines that run on every commit. Strong code review culture where the review is substantive, not ceremonial. Observability tooling that surfaces anomalies in production within minutes, not days. These conditions were the markers of high-performing engineering organizations before AI tools existed. They are now also the conditions that determine whether AI tools create value or risk.

What I find troubling as an IS researcher is that the market pressure is creating exactly the wrong sequence. Organizations feel competitive pressure to show AI adoption in their engineering organizations. The visible evidence of AI adoption is buying AI coding tools, which is easy. The invisible precondition for those tools working safely is building the engineering infrastructure that catches their mistakes, which is hard and slow and does not make for compelling board updates. The result is that many organizations are spending on AI tools before spending on the organizational conditions that make those tools safe. The McKinsey ceiling remains far away, the DORA instability signal keeps appearing in their incident logs, and the gap between potential and realization persists. IS research has good theories for explaining this pattern. I am not sure we have yet built the applied frameworks that help organizations avoid it.

---
claims_checked:
- "McKinsey State of AI 2025: GenAI economic potential $2.6T-$4.4T annually": "https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai"
- "McKinsey State of AI 2025: 88% use AI, 7% fully scaled, GenAI adoption 79%, 23% scaling agents": "https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai"
- "DORA 2024: AI adoption improves throughput but hurts delivery stability": "https://dora.dev/research/2024/dora-report/"
- "DORA 2024: AI accelerates but exposes weaknesses without automated testing, version control, fast feedback": "https://dora.dev/research/2024/dora-report/"
- "Gartner: agentic AI in 40% of enterprise apps by end 2026, up from less than 5%": "https://www.gartner.com/en/newsroom/press-releases/2026-04-07-gartner-forecasts-worldwide-it-spending-to-grow-9-8-percent-in-2026"
claims_unverified:
- "Cohen and Levinthal (1990) absorptive capacity: drawn from IS theory background, not re-fetched this session"
- "Gartner 40%+ agentic AI project cancellation by 2027: cited from Gartner reports referenced in verified data bank but specific press release not re-fetched"
sources_used:
- "https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai"
- "https://dora.dev/research/2024/dora-report/"
- "https://www.gartner.com/en/newsroom/press-releases/2026-04-07-gartner-forecasts-worldwide-it-spending-to-grow-9-8-percent-in-2026"
word_count: 1030