Giving Your Agent Eyes is not Enough
AI coding agents can build UI. But they can't verify it.
They can screenshot the simulator. They can dump the accessibility tree. But when your agent looks at your UI, it has no idea if the button is in the right place. It can tell you that the button exists, but it can't tell you that it shifted 12 pixels left after that last refactor.
This is the gap in today's tools: agents can see what's on screen, but they can't tell what's wrong.
Verification requires comparison. You know something is wrong because you remember what it looked like before. Your agent doesn't remember anything.
Agent + Simulator ≠ Testing
You've probably tried giving your agent access to the simulator: tapping around, taking screenshots, checking the accessibility tree.
But here's why it's not working:
- It's slow. Getting the accessibility tree takes time on complex UIs. Then you need to feed that to the LLM, and inference is slow. Then you tap something, and take a screenshot, and feed that right back to the LLM which is slow again. Your feedback loop takes minutes and burns through tokens.
- It can't measure. Is the corner radius 16 or 24? Is that padding 8 or 4? The agent can eyeball it, but "looks about right" is not how you ship. Compression, tokenization, and attention limits means that subtle bugs slip through unnoticed.
- It's random. Letting an agent click around might hit some paths. But is it hitting the critical flows every time, or different ones each run? Random exploration means random coverage. You can't trust that when things are mission critical.
What's Required for UI Verification
A reference state. A known-good snapshot of what the UI looked like when it worked.
Then it's simple: run the same flow. Capture the result. Diff it. Anything that changed gets flagged, and you see exactly what. The agent doesn't have to guess if the UI is correct. It just looks at the diff and reasons about whether the change was intentional.
This is why I built qckfx
qckfx records your simulator sessions and replays them as tests. You tap through a flow, it captures everything: screenshots, touch events, network responses, even disk and keychain state. That becomes your reference.
When you run the test, it replays the same flow and diffs every screen against the original. No accessibility tree parsing. No LLM in the loop during playback. The slow work happens at recording time, not during test runs.
The result is a visual diff that tells your agent exactly what changed and where. With MCP support for Claude Code and Cursor, your agent gets that diff image directly. One inference call to reason about whether the changes were intentional. Not a back-and-forth exploration burning tokens and minutes.
No test code. No SDK. No setup. Just record what you already do, and let the agent verify it.
It's free and runs locally. If you're building iOS with AI agents, give it a shot: qckfx.com