Agent-Written Tests Can't Verify Agent-Written Code

Chris Wood

Founder, qckfx

Most teams try agents the same way. Set up Claude Code or Cursor, assign tickets, let them run. The code comes back looking good. Then they realize they have no way to know if it actually is.

Their test coverage isn't enough to catch what might have broken. The tests they do have are unreliable, or testing the wrong things, or both. So they end up manually QA'ing every agent PR: pulling it locally, running the app, clicking through flows. The teams shipping 50%+ of PRs from agents figured something else out.

The problem isn't the agents. It's everything underneath them.

Testing has always been broken

Even when your test infrastructure works, your tests only cover what you thought to assert on. You write a test that checks the login button navigates to the home screen. It does. But you didn't write a test for the fact that the profile image now renders at the wrong size, or that a network request that used to fire on login stopped firing, or that the loading state flickers for 200ms before settling. Those are the regressions that ship to users.

Teams have been absorbing these costs for years. It worked well enough when the person writing the code was the same person interpreting the test results. They had context. They knew what they changed and could reason about whether a failure was related. Agents broke that assumption. You're outsourcing the coding, but the verification burden stays with you, and you have less context to do it with.

The obvious response is to have agents write more tests to improve coverage. But who verifies the tests? If you need tests to verify agent output because you can't trust it, you can't trust agent-written tests either. The verification layer has to be independent of the thing being verified.

What verification actually requires

When you write code yourself, you can interpret test failures because you know what you changed. A failing test either makes sense given your change or it doesn't. That reasoning breaks down when an agent wrote the code, and breaks down further when the agent also wrote the test. You're looking at a failure with no context for why the code changed or what the test was supposed to exercise.

The most obvious requirement is independence. If the agent writes the code and also writes the verification, you haven't verified anything. The check has to come from outside that loop. That rules out the "just have agents write more tests" approach entirely.

It also has to be deterministic. If you run the same check against the same code twice and get different results, you can't distinguish a regression from noise. Network responses, timestamps, random values, external state: all of it has to be controlled or you're back to flaky tests that everyone learns to ignore.

There's a subtler requirement too. Traditional tests check what you predicted would break. The regressions that actually ship are the ones nobody predicted. So the verification has to be broader than what you thought to assert on. And it has to work for someone who doesn't know the codebase well, because that's the whole point: you're not going to read every line the agent wrote.

So you need a check that's independent of the agent, deterministic, and catches things nobody thought to test for. The only thing that satisfies all three is treating the running application itself as the source of truth. Don't verify the code. Verify what the code produces.

Record a user flow. Replay it deterministically. Diff everything that happened. If the only thing that changed between runs is the code, then every difference in the output is a consequence of that code change. You see what moved, what broke, what disappeared. You don't need to understand the code. You just need to decide if the differences are what you expected.

Compare that to reading a 400-line diff across six files, trying to hold the whole flow in your head, and hoping you notice that a network call moved and a line item disappeared.

This is the problem I've been working on with qckfx. Look at how your team reviews agent work today. If the answer involves reading diffs and manually running the app, that's what's holding you back.