AI Codes Half of GitHub. Measured Productivity Is Nearly Flat

METR published a new survey last month. They asked 349 technical workers—software engineers, researchers, academic PhD students, founders—how AI tools had changed their productivity since early 2025. The median self-reported gain: somewhere between 1.4 and 2x. Meanwhile, METR’s own controlled experiments have shown something much flatter. Their 2026 cohort of 57 developers came in at around a 4% slowdown, a confidence interval stretching from -15% to +9%—not distinguishable from zero. When they tried to design the next round of controlled trials, 30 to 50 percent of developers refused to do tasks without AI access, making a proper control group nearly impossible to recruit.

That last detail is the one worth sitting with. The measurement problem isn’t just methodological anymore. Developers have become genuinely unwilling to test life without these tools. And yet the measured gains stay stubbornly close to flat.

Why the Surveys Don’t Measure What You Think

Survey-based productivity data measures something real—but it isn’t throughput. It’s how people feel about doing work with assistance versus without. METR explicitly flagged this in their May 2026 paper: prior research found respondents overestimated AI’s actual time impact by around 40 percentage points when asked retrospectively. Memory of struggling through hard problems without a capable autocomplete inflates the perception of the current experience.

The self-reported numbers are also very high. A median of 1.4 to 2x value gain sounds significant. Notably, METR’s own staff—the people who think hardest about how to measure this—reported lower gains than any other group in the study. Understanding the limits of a tool tends to make you more calibrated about what it’s actually delivering.

And it is delivering something. The controlled study results moved from -19% in 2025 to roughly -4% in 2026. The direction is right. It’s just not 1.4x.

Where the Gains Actually Show Up

The clearest wins are on the mechanical parts of software work: boilerplate API clients, CRUD scaffolding, test fixtures, migration scripts, type definitions, documentation drafts. Tasks that have a well-known shape, written thousands of times before, well-represented in training data. McKinsey research on routine coding tasks found a 46% time reduction in that category. That’s real, and it compounds across a project.

The gains compress fast when the work is novel, domain-specific, or requires reasoning about behavior across a system over time. A BLE peripheral profile built around a specific hardware device. A sync engine for a local-first app that needs to handle conflicts correctly on low-end tablets without network access. A regulated healthcare workflow with specific data residency requirements. On work like that, an experienced developer with an AI assistant is mostly still that experienced developer—using the tool for sections that look like boilerplate and doing the domain reasoning themselves.

What actually happens: the AI generates fast, and the developer reviews. But review takes longer for code you didn’t write yourself, especially when you need to verify it fits a domain the model doesn’t actually know. That’s probably why the controlled experiments keep landing near flat. The generation speed and the review overhead partially cancel.

There’s also a skill asymmetry worth naming. Junior developers with AI assistance often outperform expectations on familiar patterns. Senior developers reviewing AI output in unfamiliar territory often catch problems that less experienced reviewers would miss—but the catch takes time. The same tool produces very different productivity profiles depending on who’s holding it and what they’re building.

What to Ask When Evaluating a Development Partner

If a studio pitches AI as a productivity multiplier, ask the obvious follow-up: faster at what, specifically? Generating scaffolding? Yes, meaningfully. Designing a data model that fits your actual constraints? No. Catching the subtle bugs in their own AI’s output? Still requires a senior engineer who knows what correct looks like.

The honest version of “we use AI-assisted development” means the mechanical parts get handled faster, freeing engineering attention for the parts that require judgment. It does not mean you’re getting twice as much engineering for the same cost. On genuinely custom work—software built around a specific domain, hardware integration, complex synchronization, regulated environments—the hard part has never been typing the code. It’s been making the right decisions, understanding specific constraints, and knowing which tradeoffs to take. That’s still human work.

We’d push back on any shop promising significant cost reduction through AI on a complex custom build. The actual value is narrower but real: faster boilerplate, better test coverage, quicker documentation, more consistent code review. Those compound over a project. They don’t change the nature of hard engineering problems.

The honest signal: a studio that can tell you exactly where they use AI tooling and why is telling you they’ve actually thought about it. One that just says “we use AI” as a blanket claim has probably thought about it less.

What We Actually Do

We use AI tooling throughout our work. For our AI-assisted engineering engagements it’s explicit—part of the workflow from day one. Elsewhere it’s quieter: code completion during development, test generation, first-pass documentation, security pattern checking during review.

Where we don’t lean on it: system architecture, API design, anything where domain-specific constraints need to be reasoned through carefully. We’re not asking a language model to design a sync protocol for a local-first app with unusual conflict semantics. That’s precisely what clients hire us to think through.

The METR result we keep coming back to isn’t the productivity claim. It’s the 30 to 50 percent refusal rate. Developers have decided these tools are table stakes, before the measurement science has confirmed the magnitude of the benefit. That’s fine—the tools are genuinely useful and the direction is right. But if you’re evaluating a development partner and the pitch leans hard on AI productivity multipliers, it’s worth asking which specific part of the work gets faster, and by how much. The best answer will be a specific one.

If you’re thinking through what a build looks like with modern tooling baked in, get in touch.

Why the Surveys Don’t Measure What You Think

Where the Gains Actually Show Up

What to Ask When Evaluating a Development Partner

What We Actually Do

Building something like this?