Debugging Flaky Tests: A Field Guide to Finding the Real Cause
Flaky tests are not random. Three named classes — async-wait timing, concurrency, and test-order dependency — cover roughly 77% of them. Categorize the failure, replay the run three ways, and the fix stops being a guess.

A test that fails on Tuesday and passes on Wednesday with no code change is not telling you the code is broken. It is telling you the test made an assumption that does not always hold. The instinct is to hit re-run until it goes green. That works once and rots the suite forever: the flake stays, more pile up behind it, and eventually a real regression hides inside the noise.
The faster path is diagnosis, and flakiness is far more diagnosable than its reputation suggests. The failures fall into a small number of named classes, and each class has a tell you can force out with a deliberate re-run. This guide walks the categories, shows how to replay a failing run to identify the class, and ends on feeding that rebuilt run to an AI agent so it can triage the class for you.
What actually causes flaky tests?
Flaky tests are dominated by three named classes: async-wait timing problems at about 45%, concurrency and race conditions at about 20%, and test-order dependency at about 12%. Together they cover roughly 77% of cases in the canonical ten-category taxonomy. Categorizing the failure first is the single fastest route to a real fix instead of an endless re-run.
The reference for this is still Luo et al., An Empirical Analysis of Flaky Tests (FSE 2014), which classified 201 flaky-fix commits across Apache projects into ten root-cause categories. The long tail matters, but the head is what you triage first. An async-wait flake checks a result before an asynchronous operation has finished. A concurrency flake is a race condition, deadlock, or shared mutable state between threads. A test-order-dependency flake passes alone but fails when an earlier test leaves dirty global or database state behind.
Two things follow from that distribution. First, most flakes are timing or ordering problems, not deep mysteries — they are fixable once named. Second, the categories map cleanly onto three ways of re-running a failure, which is the trick the next section turns into a procedure.
Replay the failing run three ways
You do not classify a flake by staring at the log. You force the class out by re-running the same failing commit under three deliberately different conditions and seeing which one makes it pass. This is precisely the method GitHub used to cut flaky builds 18x — they reran each failure in the same process, in the same process shifted into the future, and on a different host. Those three scenarios identified 90% of flaky failures automatically and dropped flaky-build commits from about 9% (1 in 11) to under 0.5% (1 in 200).
| Feature | Rerun scenario | What it catches | Flake class |
|---|---|---|---|
| Same process, immediately | Identical env, re-execute now | Code-level randomness, a race that resolves on a second pass | Concurrency / race |
| Same process, time-shifted | Move the clock forward or inject a delay | Bad time assumptions, premature assertions before async work settles | Async wait (timing) |
| Different host | Re-run on a clean, separate machine | Shared state, leftover fixtures, env / resource / locale differences | Order or environment |
The logic is subtractive. Passes only when re-run on a different host? Something on the original machine was dirty — leftover state from a prior test (order dependency) or an environment difference. Passes only after a delay or clock shift? The test asserted before the async operation finished (async-wait). Passes intermittently in the same process with nothing else changed? A race condition. You can wire this into CI as a small retry harness rather than running it by hand.
// Re-run a single failing test three ways to classify the flake.
// Mirrors GitHub's same-process / time-shifted / different-host scenarios.
type Result = 'pass' | 'fail';
async function runOnce(testId: string, opts: {
clockOffsetMs?: number; // shift time to expose async-wait bugs
host?: 'same' | 'fresh'; // fresh host exposes order/env state
}): Promise<Result> {
// ...invoke your runner (jest/vitest/playwright) with these conditions
return 'pass';
}
export async function classify(testId: string) {
const sameProcess = await runOnce(testId, { host: 'same' });
const timeShifted = await runOnce(testId, { host: 'same', clockOffsetMs: 60_000 });
const freshHost = await runOnce(testId, { host: 'fresh' });
if (freshHost === 'pass' && sameProcess === 'fail') return 'order_or_environment';
if (timeShifted === 'pass' && sameProcess === 'fail') return 'async_wait_timing';
if (sameProcess === 'fail') return 'concurrency_race';
return 'not_reproduced'; // could not trigger the flake this run
}Where to look first: big, shared, networked tests
Not all tests are equally suspect. Google's analysis of 4.2 million test executions found a near-linear relationship between a test's size and its flakiness rate: small unit tests are rarely flaky, while large integration and emulator tests are the worst offenders. The blunt heuristic — "the larger the test, the more likely it is to be flaky" — is a genuinely useful triage filter. Sort your flaky failures by test size and dependency count, and start at the top.
There is a second filter: flakes cluster. Parry et al. (EASE 2025) re-ran 10,000 test-suite executions across 24 Java projects and found that 75% of the 810 flaky tests belonged to a co-occurring failure cluster with a mean size of 13.5 tests. Stack-trace inspection pinned intermittent networking and unstable external dependencies as the predominant shared cause. The implication is leverage: fix one shared resource — a flaky test container, an unmocked third-party call — and a dozen flakes go green at once.
Retry, quarantine, then fix — in that order
Auto-retry is a confirmation tool, not a cure. Used alone it hides the symptom and lets flakiness compound. The durable pattern — the one GitHub and Atlassian both run — is a pipeline: retry to confirm the failure is flaky, quarantine the test so it stops blocking merges, then fix the root cause and return it to the suite. Atlassian's Flakinator quarantines flakes across 12+ products and processes more than 350 million test executions per day; they trace roughly 15% of Jira backend build failures to flakiness, wasting over 150,000 developer hours per year. That is the cost of skipping the fix step.
And the problem is growing, not shrinking. The TestDino 2026 benchmark — drawing on Bitrise Mobile Insights across 10M+ builds from January 2022 to June 2025 — found the share of teams hitting test flakiness rose from 10% in 2022 to 26% in 2025. A 2026 GitHub Actions study of 1,960 Java projects independently found 3.2% of builds get rerun, 67.7% of those reruns are flaky, and flakiness affects 51% of projects — with network and dependency-resolution issues among the top causes, echoing the clustering finding. Quarantine-without-fix is a debt that compounds.
Let an AI agent triage the class from the rebuilt run
Here is the part the standard playbook stops short of. Categorize, replay, quarantine, fix — every guide teaches that loop as human work. But the three-way replay produces something an agent can read: a rebuilt failing run with DOM state, console errors, network requests, and timestamps lined up on one timeline. Expose that as structured context over MCP, and an AI coding agent (Claude Code, Cursor) can inspect the timeline directly instead of parsing a wall of log text.
This is BugMojo's wedge. The browser extension captures the failing run — session replay, console, and network — and the MCP server hands it to the agent as evidence, not prose. The agent reads whether the failure looks like an async-wait (an assertion firing before a network call settles), an order flake (state bleeding from a prior test), or an environment flake (a timing or locale difference), and drafts a targeted fix: await the settled state, isolate the shared setup, mock the unstable dependency. The same three-scenario logic GitHub ran by hand becomes evidence the agent classifies for you.
| Feature | CI flaky-test analytics | Session-replay tools | BugMojo |
|---|---|---|---|
| Historical flaky-rate detection & dashboards | ✓ | — | early |
| Auto-quarantine across many repos | ✓ | — | partial |
| Rebuilt run: DOM + console + network on one timeline | logs only | ✓ | ✓ |
| Captures the failing run with one click | — | ✓ | ✓ |
| Failing run exposed to an AI agent over MCP | — | — | ✓ |
| Agent classifies flake class & drafts the fix | — | — | ✓ |
If you can replay the failing run, you can name the class. And once you can name the class, the flake is already half fixed.BugMojo engineering
Install the free BugMojo extension to capture the failing run — replay, console, and network — and let an AI agent triage the flake class over MCP. No project setup required.
Install the extensionFrequently asked questions
Frequently asked questions
Sources
- Luo, Hariri, Eloussi, Marinov — "An Empirical Analysis of Flaky Tests" (the canonical 10-category taxonomy) — ACM SIGSOFT FSE 2014, University of Illinois (2014)
- Parry, Kapfhammer, Hilton, McMinn — "Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures" — arXiv / EASE 2025 (2025-04-23)
- "Reducing flaky builds by 18x" — GitHub's three rerun scenarios and before/after flaky-commit rate — The GitHub Blog (2020-12-16)
- "Where do our flaky tests come from?" — 4.2M tests analyzed; flakiness scales with test size — Google Testing Blog (2017-04-17)
- "Taming Test Flakiness: How We Built a Scalable Tool to Detect and Manage Flaky Tests" (Flakinator) — Inside Atlassian (Engineering) (2025)
- Ge & Zhang — "Understanding and Detecting Flaky Builds in GitHub Actions" (1,960 Java projects) — arXiv, Soochow University (2026-02-02)
- "Flaky Test Benchmark Report 2026: Rates, Root Causes, and Cost Implications" — TestDino (2026)
Get bug-tracking insights, weekly.
Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.

