Copilot Generated More Code and More Bugs

Thirty percent. That is the acceptance rate GitHub reports for Copilot suggestions. About three out of every ten completions a developer clicks Tab on. The number appears in case studies and vendor presentations as a success metric, evidence that the tool is working. And every time I see it, I think about Burton-Jones and Grange (2013) and what they would say about measuring AI coding tools this way.

I wrote recently about the difference between adoption and effective use, and the argument applies to AI code generation more directly than I realized when I wrote that post. The thirty percent acceptance rate measures adoption. It tells you how often a developer chose to accept a suggestion. It does not tell you whether the accepted code faithfully represents the domain the codebase was designed to support. It does not tell you whether the generated function aligns with the architecture, the business logic, or the security requirements. It tells you that Tab was pressed. That is surface structure.

Burton-Jones and Grange grounded their theory of effective use in representation theory. Every information system has three structures. Deep structure is the real-world domain the system represents. Surface structure is how that domain appears to users. Physical structure is the technological implementation. Effective use happens when the user engages the deep structure, not just the surface structure. A CRM user who opens the application and clicks through tabs engages the surface structure. A CRM user who uses the application to understand and act on customer relationships engages the deep structure. Same application. Different use.

Now map this to Copilot. The developer opens an editor, types a comment or a function signature, and Copilot suggests a completion. The developer presses Tab. The suggestion is syntactically correct. It compiles. It passes the unit tests. But does the generated code represent the domain faithfully? Does it handle the edge case the business analyst documented in the requirements spec? Does it use the right internal API, the one that triggers side effects the developer forgot about? Does it respect the caching strategy, the rate limits, the data validation rules, the security boundaries that exist in the codebase because of past incidents?

The answer is often no. Industry data suggests AI-assisted developers produce more code but also introduce more bugs. SonarQube data shows AI-generated code has higher churn, meaning it gets rewritten more often. This is not a bug in the model. It is a structural limitation. Large language models are trained on patterns. They are excellent at surface structure: syntax, common patterns, boilerplate. They are unreliable at deep structure: the specific architectural decisions, trade-offs, and requirements encoded in a particular codebase. A model does not know why a particular convention exists. It only knows that the convention appears in its training data with some probability.

What the thirty percent acceptance rate actually measures is how often Copilot produces surface structure that looks right to the developer at a glance. It does not measure whether the developer verified the suggestion against the domain. And Burton-Jones and Grange would say that verification is the essence of effective use. Using a system effectively means using it in a way that faithfully appropriates the domain representations the system was designed to support. A developer who accepts a Copilot suggestion without tracing its implications through the codebase is interacting with the surface structure. The code compiles. The tests pass. The deep structure might be wrong.

I think the practical problem is that organizations are measuring the wrong thing. They track lines of code generated, pull request velocity, acceptance rates. These are adoption metrics. They tell you whether developers are using the tool, not whether they are using it well. And the tools themselves encourage this. Copilot surfaces completions in line, one tab away from insertion. The friction is low. The cognitive cost of evaluating a suggestion against the codebase is high. The developer has to stop, think about whether this code actually belongs here, trace the implications, and potentially reject a syntactically perfect suggestion that is semantically wrong. That takes time. That takes domain knowledge. That takes the thing the developer is trying to offload.

A lot of the conversation around AI coding tools frames productivity as output. More code per hour, more pull requests per week. Burton-Jones and Grange explicitly distinguished effective use from efficiency. Faithful domain representation is not speed, and it is not output quantity. If the generated code introduces bugs that are caught in review or in production, the output is misleading. The code was produced fast, but it was not produced faithfully. The time saved at the generation step is spent later in debugging and rewriting, which is exactly what higher churn rates suggest is happening. I wrote about the same dynamic in the context of vibe coding, where the speed of generation creates downstream costs that nobody accounts for.

Baird and Maruping (2021) reformulated the use construct entirely for agentic systems, arguing that delegation replaces use when the system acts on behalf of the user. Copilot is not an agentic system in the strict sense; it does not autonomously act. But the direction of travel is worth watching. When a developer accepts a completion without understanding it, they are functionally delegating part of the coding decision to the model. The acceptance rate metric treats every suggestion equally, but not every suggestion carries the same weight. A boilerplate getter function accepted without review is low stakes. A database query that the developer did not fully understand is high stakes. The acceptance rate collapses both into a single number. I wrote about why measuring delegation rather than use matters for agentic systems, and I think the same principle applies here at a smaller scale. Not every acceptance is a use, and not every use is an effective use.

If I were advising a team adopting Copilot or any AI coding assistant, I would say the same thing Burton-Jones and Grange said about enterprise systems fifteen years ago. Stop measuring adoption and start asking whether the tool is helping developers represent their domain more faithfully. The metric is not acceptance rate. It is whether the generated code survives review without structural revisions. It is whether the bug rate per line of generated code is comparable to or lower than the bug rate per line of hand-written code. It is whether the developer understands why the generated code works, not just that it works.

The productivity gains from AI coding tools, I think, will not come from writing more code faster. They will come from developers spending less time on surface structure and more time on deep structure. Less time writing boilerplate, naming conventions, and standard patterns that the model can handle. More time on architecture, trade-offs, and requirements alignment. That is exactly what Burton-Jones and Grange would predict. Use is not adoption. Effective use is not speed. And the tool that writes code faster is only useful if the code it writes is worth keeping.

```json
{
"@context": "https://schema.org",
"@type": "Article",
"headline": "Copilot Generated More Code and More Bugs",
"author": {
"@type": "Person",
"name": "Ali Safari",
"url": "https://alisafari.space"
},
"datePublished": "2026-05-14",
"description": "Thirty percent acceptance rate tells you about adoption, not whether the code solves the problem. Burton-Jones and Grange defined effective use as faithful domain representation, not output volume.",
"keywords": ["AI coding tools", "effective use", "Burton-Jones and Grange", "Copilot", "software quality", "AI adoption", "developer productivity", "information systems"]
}
```

```json
{
"@context": "https://schema.org",
"@type": "FAQPage",
"mainEntity": [
{
"@type": "Question",
"name": "Why is Copilot's 30% acceptance rate a misleading metric?",
"acceptedAnswer": {
"@type": "Answer",
"text": "The acceptance rate measures adoption, not effective use. It captures how often a developer clicked Tab on a suggestion, but does not evaluate whether the generated code faithfully represents the codebase architecture, business logic, or security requirements. A high acceptance rate can coexist with bug-prone, high-churn code."
}
},
{
"@type": "Question",
"name": "How does Burton-Jones and Grange's effective use theory apply to AI code generation?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Burton-Jones and Grange defined effective use as faithful domain representation through three structures: deep, surface, and physical. AI coding tools like Copilot handle surface structure well (syntax, patterns) but struggle with deep structure (architecture, trade-offs, requirements). Developers who accept suggestions without verifying them against the domain are not using the tool effectively."
}
},
{
"@type": "Question",
"name": "What should organizations measure instead of AI coding tool acceptance rates?",
"acceptedAnswer": {
"@type": "Answer",
"text": "Organizations should measure whether generated code survives review without structural revisions, compare bug rates between generated and hand-written code, and assess whether developers understand why the code works. The goal is evaluating whether the tool helps developers represent their domain more faithfully, not whether they accept suggestions quickly."
}
}
]
}
```