How to Fix Flaky iOS UI Tests

Chris Wood

Founder, qckfx

Flaky tests are the number one reason teams stop trusting their test suite. A test that passes 90% of the time is arguably worse than no test at all, because it trains developers to ignore failures. When the test that cried wolf finally catches a real regression, nobody is listening.

iOS UI tests are particularly prone to flakiness. XCUITest, the standard framework Apple provides, runs a separate process that communicates with your app over an accessibility bridge. Every interaction crosses a process boundary, every assertion depends on the app's UI state settling, and every network call introduces variability. The result is tests that work on your machine, fail in CI, pass again when you re-run, and fail differently the next day.

This post digs into the root causes of flaky iOS UI tests and walks through the standard mitigations, their limitations, and a fundamentally different approach that eliminates flakiness at its source.

Why iOS UI Tests Flake

Flakiness in iOS UI tests comes from four main sources. Understanding each one is important because the right fix depends on which cause you're dealing with.

Timing Issues

iOS apps are full of animations, asynchronous data loading, and transitions. When a test taps a button, the resulting view might take 300 milliseconds to animate into place. The test runner doesn't know about your animation curve. It just fires the next assertion, and if the view hasn't appeared yet, the test fails.

Race conditions between the test runner and the app are the most common source of flakiness. The test process and the app process run independently. A network response might arrive before or after the test checks for a loading spinner. A navigation animation might complete before or after the test looks for the destination view. These timing windows are measured in tens of milliseconds, so the test passes most of the time but fails unpredictably under load or on slower CI machines.

Network Dependency

Many UI tests hit real API endpoints. This introduces multiple sources of variability: the server might be slow, the response payload might have changed, a rate limiter might kick in, or the server might be down entirely. Even if you control the API, response times fluctuate. A test that passes in 200ms on a fast network might time out on a congested CI runner.

Worse, real API responses change over time. A test that asserts on specific content from a feed or a list will break when the backend data changes, even though the app code is correct. This creates false negatives that erode trust in the suite.

State Leakage

iOS simulator state persists between test runs unless you explicitly reset it. If Test A logs in and Test B assumes a logged-out state, Test B will fail when run after Test A. Keychain entries, user defaults, cached data, and file system artifacts all carry over. This means test order matters, and tests that pass individually can fail when run as part of a suite.

State leakage is especially insidious because it often doesn't manifest locally. A developer runs the failing test in isolation, it passes, and they conclude the flakiness is “just CI being weird.” The real problem is that CI runs the full suite in a different order or with accumulated state from previous pipeline runs.

Selector Fragility

XCUITest identifies elements by accessibility identifiers, labels, or their position in the view hierarchy. When a designer restructures a screen, renames a component, or wraps a view in a new container, selectors break. The test fails not because the app is broken, but because the test can't find the element it's looking for.

Teams mitigate this by adding stable accessibility identifiers to every testable element, but this adds overhead and pollutes the codebase with test-specific metadata. It also requires ongoing maintenance: every new view needs identifiers, and every refactor risks orphaning old ones.

The Standard Fixes (And Why They Fall Short)

The iOS testing community has developed a set of standard practices for dealing with flakiness. Each one helps, but none fully solves the problem.

waitForExistence and Expectation Timeouts

The most common fix for timing issues is waitForExistence(timeout:) or XCTest expectations with explicit timeouts. Instead of asserting immediately, the test waits up to N seconds for an element to appear.

This helps, but it introduces arbitrary delays. Set the timeout too low, and the test still flakes on slow machines. Set it too high, and your test suite takes forever. Either way, you're papering over the timing problem rather than solving it. The test still depends on the app reaching a specific state within a time window. You've just made the window wider.

Mock Servers and Stub Responses

To eliminate network variability, teams set up local mock servers or inject stub responses into the networking layer. This is effective. Fixed responses mean fixed behavior.

The cost is setup and maintenance. Every API endpoint your tests touch needs a corresponding mock. When the API changes, mocks need updating. For apps with dozens of endpoints, the mock layer can become a significant codebase of its own. Teams also need to decide what to mock (just the network? or also things like push notifications, location services, and in-app purchases?), and each decision adds complexity.

Resetting the Simulator Between Runs

To fix state leakage, teams reset the simulator before each test run. This works, but it's slow. A full simulator reset adds 10-30 seconds per test. For a suite of 50 tests, that's an extra 8-25 minutes of CI time. Even a partial reset (clearing user defaults and keychain) adds overhead and still doesn't address timing or network issues.

Retry on Failure

The bluntest instrument: if a test fails, run it again. If it passes the second time, call it a success. Many CI configurations do this automatically, and XCTest even has built-in support for test repetition modes.

Retries mask the problem instead of fixing it. A test that requires two attempts to pass is still flaky. You're spending double the CI time, and the retry policy gives teams an excuse not to investigate the root cause. Over time, more tests start requiring retries, CI gets slower, and the suite becomes a maintenance burden rather than a safety net.

Eliminating Flakiness at the Root

All of the standard fixes share a common limitation: they treat symptoms rather than the underlying cause. The fundamental reason iOS UI tests are flaky is that they execute live against a non-deterministic environment. The app talks to real (or semi-mocked) servers. Animations play at real speed. State carries over from previous runs. Each of these moving parts introduces variability, and variability produces flakiness.

A fundamentally different approach is to remove the non-determinism entirely. Instead of running the app live and hoping everything lines up, capture the entire session (network traffic, timing, interaction events, app state) during a known-good run, then replay that session identically every time.

Record and Replay

The record-and-replay approach works by intercepting everything that makes a test run non-deterministic and replacing it with recorded data. Network responses are captured during recording and served from the recording during replay. This eliminates API variability entirely. There's no server to be slow, no payload to change, no rate limiter to trigger.

Non-deterministic data like timestamps and UUIDs are seeded during replay, so the app sees the same values every time. This prevents date-dependent UI from shifting between runs and eliminates a whole class of comparison failures that have nothing to do with actual regressions.

For verification, rather than relying on brittle selectors to find and assert on individual elements, a replay-based system can compare the entire screen visually. Accessibility tree matching handles minor layout shifts, font rendering differences, and other pixel-level noise without breaking the test. When something does change, you get a visual diff showing exactly what's different, not a cryptic “element not found” error.

The result is a test that produces the same output every time it runs. There are no timing windows, no network variability, no state leakage, and no selector maintenance. The test either matches the baseline or it doesn't, and when it doesn't, you can see exactly why.

How qckfx Makes iOS Tests Deterministic

qckfx implements the record-and-replay approach for iOS. The workflow starts with recording: you use your app in the simulator normally, tapping through flows, scrolling, entering text. No test code is needed. qckfx captures every interaction event and every network response behind the scenes.

After recording, you or your AI coding agent can replay the test at any time and see the results. Every replay uses the exact same network responses, the exact same timing, and the exact same app state as the original recording. The replay is deterministic by construction, not by accident.

Visual diffing is how qckfx verifies correctness. Instead of writing assertions like “the label should say X” or “the button should be at position Y,” qckfx compares the screen at each step of the replay against the recorded baseline. If a real regression is introduced (a missing button, a broken layout, incorrect text), the diff highlights it. If nothing meaningful changed, the test passes. This catches regressions that selector-based tests would miss, like a view being rendered off screen or an image failing to load.

For AI coding agents, qckfx provides an MCP server that Claude Code, Cursor, or Codex can call directly. The agent runs a test and gets back a pass/fail result along with screenshots of any diffs, logs from the app during the test run, and a timeline of network requests highlighting anything anomalous. This gives the agent enough context to diagnose and fix failures without human intervention, and without burning tokens on screenshot interpretation loops.

Because every source of non-determinism is controlled, qckfx tests don't flake. They produce the same result whether you run them locally, in CI, on a fast machine, or on a slow one. The test is a comparison against a fixed baseline, and comparisons are deterministic.

Getting Started

Install qckfx via Homebrew:

brew install --cask qckfx

Alternatively, you can install from the tap directly:

brew install qckfx/tap/qckfx

Or download the app directly:

Once installed, launch qckfx and open your app in the iOS Simulator. Start a recording session, then use your app normally. Tap through the flow you want to test, whether that's a login sequence, a checkout process, or a settings screen. When you're done, stop the recording. qckfx saves the entire session as a replayable test, including all network traffic.

To verify your app still works after a code change, replay the test. qckfx will run through the same flow with the same network responses and compare every screen against the baseline. If something changed, you'll see exactly what. If everything matches, the test passes.

If you're using an AI coding agent, install the qckfx MCP server from the menu bar icon and your agent can run tests directly. The agent makes a change, replays the tests, sees the results, and iterates. No manual verification needed.

Learn more at qckfx.com.