Claude Code is writing your software, but who's testing it?

AI coding tools dramatically accelerate software development, but without an independent, autonomous testing layer to validate what they generate, they mostly shift the bottleneck from writing code to shipping undetected defects faster.

TABLE OF CONTENTS

Who's Testing That Code?

Who's Testing That Code?

We are living through one of the strangest moments in the history of software development. Engineers are shipping more code than ever, and individual productivity is through the roof. Yet, when you talk to engineering leaders, very few will tell you their organization is actually shipping faster or better. Something is off. And if you look closely, the answer is obvious: we solved the coding bottleneck and forgot to move the testing bottleneck with it.

The Productivity Story Is Real

Let me start with what is happening. The productivity gains from tools like Claude Code are not marketing. They are real and large.

Jaana Dogan, a Principal Engineer at Google working on Gemini, recently shared that she gave Claude Code a description of a distributed agent orchestrator problem, and it generated, in about an hour, what her team had been trying to build for a year. She was careful to clarify it was a proof of concept, not production-ready code, but still. An hour versus a year. [1]

Boris Cherny, the creator of Claude Code, reported that in a recent thirty-day window, 100% of his contributions to Claude Code itself were written by Claude Code—one hundred percent. The tool is now building itself. [1]

The numbers at the team level are just as striking. Developers using Claude Code are completing 21% more tasks and merging 98% more pull requests. [2] Teams that used to measure a feature in months are now measuring it in weeks. Individual developers are shipping full Next.js applications — auth, database, payments, the whole stack — from a single prompt, in an afternoon. [3]

If you are a developer and you are not using agentic coding tools yet, you are about to be out-shipped by everyone who is. That part of the story is settled.

And Yet Nothing Is Actually Shipping Faster

Here is where it gets weird.

The same research that shows those enormous individual gains also shows that organizational DORA metrics — deployment frequency, lead time, change failure rate, mean time to recovery — have remained largely unchanged. [2] Let that sink in. Individual developers are merging 98% more pull requests, and the business is deploying at roughly the same pace it was before.

Where is all that productivity going?

It is going into the code review queue. It is going into testing on the staging environment. It is going into production issues that customers catch. Benchmark data now show a 9% increase in cycle time for AI-heavy teams, and a 23.5% increase in security incidents per pull request. [4] The time engineers used to spend writing code is now spent debugging failed deployments and analyzing unexpected system behavior.

The AI-generated code moved the bottleneck.

Why Claude Code Cannot Validate Its Own Work

I want to be careful here because I am not criticizing Claude Code. It is a fantastic tool, and our team uses it every day. But it is a code generator, not an oracle. And the research on what AI-generated code actually looks like when it hits production is sobering.

AI-generated pull requests contain about 1.7 times more issues than human-only PRs. Excessive I/O operations are roughly 8 times more common. [5] Studies have found that roughly 45% of AI-generated code contains security flaws, 26.6% of AI-generated programs produce incorrect outputs, and silent logic failures — code that passes tests but fails in production edge cases — account for about 60% of faults. [6] Even the best-performing models generate around two quality issues for every unit test they pass. [7]

There is a specific failure pattern that matters for this conversation. When you ask an AI to write the code and then ask the same AI to write the tests for that code, the tests often restate the same logic as the implementation. If the implementation sorts a list, the test recreates the same sorting logic and verifies the output using the same assumptions. The test passes even when the implementation is subtly wrong. Coverage tools report a beautiful green checkmark. Everyone feels safe. Nobody is. [8]

This is the core problem: you cannot ask the student to grade their own exam. An AI code generator optimizing for plausible output and an AI validator optimizing for breaking that output must be different systems with distinct objectives, looking at the software from different angles. Otherwise, the whole development loop collapses into an illusion of quality.

Unit Tests Were Never Going to Save Us

The natural response to all of this is to say, "Fine, we will just write more unit tests." And many teams are doing exactly that, often by asking Claude Code to generate tests alongside the code.

Unit tests help at Marick's lower-left quadrant — technology-facing tests that support the team. It does not help much anywhere else. Most of the defects that hurt real users live at the system level: a workflow that spans six screens and three services, a form that behaves differently on the twelfth step than on the second, a race condition that only shows up when two users hit the same endpoint, a business rule that is satisfied in isolation but violated in combination with another feature.

Achieving meaningful coverage at the system level has always been the hardest problem in testing. The combinatorics are brutal. A modest web application with ten workflows, each with a handful of decision points and input variations, generates thousands of meaningful paths. Humans cannot write enough scripts to cover that surface. Traditional automation frameworks break whenever the UI changes. And now Claude Code is giving each of your developers the ability to ship three times as many features into that already-under-tested surface.

The math does not work. You cannot out-type this problem with more human testers, and you cannot out-script it with more Selenium.

The Missing Piece: An Independent Autonomous Validator

This piece is part of the industry conversation that still has not caught up to reality. If your coding layer is autonomous, your testing layer needs to be autonomous too. And critically, it needs to be a separate system, with its own objectives, looking at your application from the outside — the way a user would — without access to the source code the generator just produced.

That is exactly what Testaify does. Testaify discovers your application on its own, designs test cases using real testing techniques like boundary value and state transition coverage, generates its own input data, and executes those tests end-to-end. It does not read your code. It exercises your application the way a skilled tester would, at a scale no human team can match, and it keeps doing it every time the code changes. That is what "autonomous" actually has to mean at the validation layer: not just automated execution, but autonomous design and coverage.

Pair something like Claude Code on the generation side with something like Testaify on the validation side, and you start to see how the math could actually work. Developers ship more. An independent system, with different goals and a different view of the application, tries to break what was shipped. The quality bottleneck stops being a tax on your productivity gains and starts becoming part of the loop. DORA metrics finally move in the right direction.

Without that second system, you don't have faster software development. It is a faster accumulation of undetected defects, wrapped in a green CI pipeline.

The Future Is Autonomous on Both Sides

The agentic coding wave is not slowing down. Claude Code is going to keep getting better, the tools around it are going to keep getting better, and the amount of code flowing into your application is going to keep going up. The question every engineering leader should be asking right now is not "how do I get more out of Claude Code?" You already are. The question is "what is my independent validator, and is it autonomous enough to keep up?"

If the answer is a pile of flaky Selenium scripts and a shrinking QA team writing test cases by hand, you already know how that story ends.

Stop praying for quality. Build the other half of the loop.

References

[1] Juan Diego, "How I use Claude Code to accelerate my software engineering job and improve my life," DEV Community, January 2026. https://dev.to/juandj/how-i-use-claude-code-to-accelerate-my-software-engineering-job-and-improve-my-life-8o7

[2] Thierry Donneau-Golencer, "How to Measure Claude Code ROI: Developer Productivity Insights with Faros," Faros AI, January 2026. https://www.faros.ai/blog/how-to-measure-claude-code-roi-developer-productivity-insights-with-faros-ai

[3] "Claude Code: Revolutionizing Web Development with AI in 2025," Digital Applied, December 2025. https://www.digitalapplied.com/blog/claude-code-ai-development-revolution

[4] QASource, "AI-generated Code Security Risks: Why Incidents Per Pull Request Have Increased by 23.5%," April 2026. https://www.qasource.com/blog/ai-generated-code-security-risks

[5] CodeRabbit, "State of AI vs Human Code Generation Report," December 2025. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report

[6] Ranger, "Common Bugs in AI-Generated Code and Fixes," February 2026. https://www.ranger.net/post/common-bugs-ai-generated-code-fixes

[7] "Assessing the Quality and Security of AI-Generated Code: A Quantitative Analysis," arXiv, August 2025. https://arxiv.org/html/2508.14727v1

[8] Nobl9, "A Guide to the Risks of AI Generated Code," February 2026. https://www.nobl9.com/resources/risks-of-ai-generated-code

About the Author

Rafael E Santos is Testaify's COO. He's committed to a vision for Testaify: Delivering Continuous Comprehensive Testing through Testaify's AI-first testing platform. Testaify founder and COO Rafael E. Santos is a Stevie Award winner whose decades-long career includes strategic technology and product leadership roles. Rafael's goal for Testaify is to deliver comprehensive testing through Testaify's AI-first platform, which will change testing forever. Before Testaify, Rafael held executive positions at organizations like Ultimate Software and Trimble eBuilder.

Take the Next Step

Testaify is in managed roll-out. Request more information to see when you can bring Testaify into your testing process.

Claude Code is writing your software, but who's testing it?