How good are coding agents today? What do the benchmarks tell us? / 9 Tunnels

When recent leaderboard results on SWE-Bench Verified crossed roughly 93%, half my feed celebrated and the other half panicked. Both halves were misreading the number.

A score that high on a benchmark sounds like an indictment or a vindication of the profession, depending on which group chat you read. It is neither. The score measures something quite specific. Once you understand what it actually captures, you can read every future benchmark result without falling into either camp, and you can use the trajectory of those scores as a real signal about where the work is heading.

What SWE-Bench actually is

SWE-Bench was introduced in 2023 by Princeton researchers. The setup is simple. They scraped real GitHub issues from popular open-source Python repositories like Django, scikit-learn, and sympy. For each issue, they recorded the bug description and the patch that maintainers actually shipped to fix it. The benchmark hands a model the issue text and the repository, and asks it to produce a patch. Success is defined by passing a set of hidden benchmark tests derived from the original issue and the maintainer’s patch.

The original SWE-Bench had 2,294 of these tasks. SWE-Bench Verified, released by OpenAI in August 2024, is a hand-curated subset of 500 tasks that human annotators confirmed were well-specified and solvable. SWE-Bench Pro, released in 2025, was built specifically to be harder and contamination-resistant: new repositories, longer tasks, fresh content the models almost certainly have not seen.

A typical task looks like this. A maintainer of a popular library opens an issue: under a specific input pattern, one of the library’s public methods returns the wrong value, while the common cases still work correctly. The fix turns out to be about a dozen lines in a single file. The hidden test suite verifies the broken case is now handled and confirms the cases that already worked still pass.

That is the unit of work the model is being graded on. Not “should we redesign this API.” Not “what is the better abstraction here.” A scoped bug fix, with the test already written, in a repo whose conventions are visible.

The contamination problem

Here is where SWE-Bench Verified stops flattering the leaderboard. The repositories are public. The issues are public. The patches that fixed them are public. That combination creates real contamination risk for any model trained on broad web data, and OpenAI’s own audit of Verified found evidence of prior exposure in frontier models, with some portion of tasks effectively studied in advance.

SWE-Bench Pro exists precisely to address this. Top scores on the standardized public Pro leaderboard sit in the mid-40s, well shy of the Verified numbers, with vendor-reported custom-scaffold runs landing higher. Pro is the benchmark to watch when you want to know what a model can do on problems it genuinely has not seen. Verified is the benchmark to watch when you want to know what marketing departments will quote.

Both numbers are real. They just measure different things. Any engineer evaluating these tools should weight Pro more heavily, because contamination puts a thumb on the Verified scale and that thumb is not small.

That pattern holds across the frontier model field. Every model with a published score on both leaderboards posts a materially lower Pro number than Verified number, often by 30 points or more.

SWE-Bench Verified versus Pro scores for six frontier models in May 2026, with Verified bars consistently exceeding Pro bars across the entire model field.

What 93% does not mean

The part that genuinely surprises engineers, once they read the methodology, is the human comparison. OpenAI’s annotators estimate that 91% of SWE-Bench Verified tasks would take a human expert less than an hour each to solve, with roughly 39% rated as trivial fixes under fifteen minutes. The benchmark is mostly small, well-scoped bug fixes. It is not “design the auth system.” It is not “debug the production incident with a colleague on a call.” It is not “decide what to build next quarter.”

This is not a criticism of the benchmark. A benchmark needs constrained tasks with verifiable success criteria, otherwise it is not a benchmark. But it does mean the score answers a very specific question.

Run the same setup on a senior engineer dropped cold into one of these repos, with no internet, no Stack Overflow, no colleague to ask, no git blame to consult, and no familiarity with the codebase, and the resulting score would land well below their normal output. We do not have rigorous human numbers for this counterfactual, and that absence is itself part of the point: the benchmark does not measure what a human in a normal working environment can do. The gap between a model in the 90s on Verified and a hypothetical human under the same constraints does not measure how much better the AI is at engineering. It measures how much faster the AI can pattern-match against training data on a tightly scoped problem under conditions that strip away most of what a human engineer would normally bring to the work.

Strip the constraints away and you strip away most of what the score is capturing. Real engineering happens with internet access, colleagues, git history, repo familiarity, and the slow accumulation of context that benchmarks deliberately exclude. The benchmark is a useful instrument precisely because it controls for those variables. Treating it as a verdict on the profession means mistaking the lab conditions for the world.

Scaffolding does most of the work

The single most under-appreciated finding in the benchmark landscape is this: the same model produces wildly different scores depending on the system around it.

Run Claude Opus 4.5 inside three different agent harnesses and you get SWE-Bench results ranging from 50.2% to 55.4%. That is a 5.2 point swing from changing nothing about the model. The tools the agent can call, the way it plans, the way it retries, the way the context is managed, the way tests are surfaced as feedback: these account for as much variance as a model-generation upgrade.

The implication is direct. When a vendor pitches you a coding tool, the model name is the least informative thing on the slide. The scaffolding is the product. How the harness decomposes the task, how it plans, how it retries when tests fail, how it surfaces test output back into the next attempt, how it manages context across long runs, which tools it routes to and when, what guardrails sit between the model and the file system: that orchestration layer is where the gains live, and it is where vendors actually compete once everyone has access to the same underlying models. Two teams using the same underlying model can ship dramatically different outcomes based purely on what they wrap around it.

This is the right way to read the leaderboard. A 55% score from one harness and a 50% from another on the same model is not a model comparison. It is a tooling comparison. Anyone shipping AI development workflows in production is already learning this the expensive way.

The METR result

Now to the part that goes hardest against the hype.

In July 2025, METR, an AI research nonprofit, ran a randomized controlled trial. Sixteen experienced open-source developers. 246 real tasks in their own repositories, the kind of work they would have done anyway. Half done with Claude 3.5 and 3.7 Sonnet and Cursor Pro. Half done without. Randomized assignment to control for difficulty.

Before the study, the developers predicted AI would make them 24% faster. After the study, the same developers reported a 20% perceived speedup. The actual measured result: 19% slower. Not 19% faster with caveats. Slower.

The reasons are exactly the things benchmarks hide. Review overhead: every AI suggestion has to be evaluated before it ships, and evaluation costs time. Context switching: humans fluent in a 200,000-line codebase pay a tax to translate the agent’s general knowledge into their specific structure. Hallucinations in complex codebases: the model invents APIs that look right and fail at runtime. Integration friction: stitching the agent’s output into existing patterns is its own kind of labor.

Brownfield work, which is where most professional engineering happens, looks nothing like benchmark conditions. The repos are old. The patterns are inconsistent. The history is opaque. The codebase has private idioms the agent has never seen.

Follow-up work with newer models and more capable agent harnesses may well show better numbers, and there is good reason to expect it will. But the 2025 result stands as a warning about how benchmark gains translate into real brownfield productivity. The translation is not automatic. Teams that assumed it would be have paid for that assumption with calendar time and morale.

There is a softer version of this story too. Greenfield prototyping, where the agent gets to make all the structural decisions on a fresh canvas, can deliver genuine speedups, sometimes substantial. Integration work on existing systems clusters much closer to break-even, and often lands on the slower side once review overhead is honestly accounted for. Mixed work falls somewhere between. The exact figures vary by team, codebase, and study, but the shape is consistent enough to plan around. The benchmark score is essentially measuring the greenfield-shaped case and getting labeled “software engineering” on the way to the press release.

Why the trajectory still matters

So far this piece has been about why today’s scores are misleading. The harder and more honest point is the trajectory.

Three years ago, top models could not break 5% on SWE-Bench. Today they break 90% on Verified, and the standardized public Pro leaderboard sits in the mid-40s with vendor custom-scaffold runs reported materially higher. The exact number depends on the harness and that variance is itself part of the story. Even after discounting Verified for contamination, the slope on Pro is real. Pro top scores were under 30% twelve months ago and have climbed steadily since.

Terminal-Bench 2.0, a different benchmark that measures end-to-end terminal operation (building Linux kernels, configuring servers, modernizing COBOL codebases), has shown the same trajectory. Top scores climbed substantially in the year after launch as both model capability and scaffolding sophistication compounded. The benchmarks themselves are getting harder, and the models keep catching up to where the benchmarks were yesterday.

You can argue about how to weight any single score. The slope is harder to argue with. Engineers who orient around what AI cannot do today will find that set shrinking measurably, year by year. The line where “the agent handles it” sits in 2026 is not where it will sit in 2028. The takeaway is not that the profession is going away. It is that the work inside the profession is being relocated, and the relocation is faster than most engineers are pricing into their career bets.

What we see in production

At Qandaba we work with engineering organizations adopting these agents in real workflows, and the gap between benchmark performance and production reliability is the territory we spend most of our time on. The recurring failure modes show up in every serious deployment: context degradation as agents lose track of what they were doing, specification drift between what was asked and what was built, sycophantic agreement when the model should be pushing back, tool selection errors when the agent reaches for the wrong utility, cascade failures where one wrong decision compounds across an agent run, silent failures where the code runs and produces wrong output without flagging anything. None of these show up on a benchmark. All of them show up the first time you ship.

Three takeaways for engineers

If you are an engineer or engineering leader trying to make sense of all this, three things are worth holding onto.

First, treat benchmark scores as signal about constrained, well-scoped task performance under conditions that are unfair to humans. They are not verdicts on the profession. The 93 on Verified does not mean what the headline suggests; the mid-40s on the standardized public Pro leaderboard is closer to a credible read on current capability against genuinely fresh problems. Read every future score the same way: ask what kind of problem, under what conditions, with what scaffolding, on what subset.

Second, the scaffolding around the model matters more than the model version for production outcomes. The 5-point swing from changing harnesses on the same model tells you where the leverage actually lives. Engineers who learn to build, configure, and verify agent systems will compound advantages. Engineers who stop at “I prompted the chat window” will find their work increasingly commodified, because that is the exact task each new model release is targeting.

Third, the trajectory is real even after you discount the hype. The line moves. Orient your skill development around the parts of engineering that benchmarks cannot yet capture: system design, ambiguity resolution, contextual judgment, the slow accumulation of domain knowledge that makes a good engineer hard to replace. Those are the durable parts. Not because they will never be automated, but because they will be automated last, and the years between now and then are where careers compound.

The benchmarks are not lying. They are answering a more specific question than most readers think they are. Read them that way and you will be ahead of the next wave of headlines, instead of reacting to them.

How good are coding agents today? What do the benchmarks tell us?

What SWE-Bench actually is

The contamination problem

What 93% does not mean

Scaffolding does most of the work

The METR result

Why the trajectory still matters

What we see in production

Three takeaways for engineers

More from Engineering & AI

Stop asking AI questions. Start having it do things for you.

Control of your codebase is an illusion

AI should never be the reason you're laying off people