How do I tell if a failing test is flaky or a real bug?

Re-run the exact same commit without changing code. If it passes on retry, it is flaky by definition: a failure that is not caused by a code change. To find the class, re-run it three ways — immediately in the same process, again after shifting the clock or with a delay, and once on a different machine. Passing only on a different host points to an environment problem; passing only after a delay points to a timing or async-wait problem; failing only after other tests run points to test-order dependency.

What are the main categories of flaky tests?

The widely cited Luo et al. taxonomy lists ten, but three dominate: async/wait timing problems (a test checks a result before an async operation finishes), concurrency issues (race conditions, deadlocks, shared mutable state), and test-order dependency (a test passes alone but fails when an earlier test leaves dirty global or database state). Environmental causes — CI runner differences, resource limits, network latency, timezone and locale — are the other big bucket. Naming the category first is what makes a flake fixable instead of just re-runnable.

Why do my tests pass locally but fail in CI?

That gap is almost always environmental or order-related. CI runs on different hardware, with tighter CPU and memory limits, a different timezone or locale, slower or unreliable network, and a different test execution order or parallelism than your laptop. Reproduce it by running the suite in the same order CI uses (many runners randomize or shard), under the same resource constraints, and with the same environment variables. Capturing the failing run — its console output, network calls, and timing — lets you compare CI against local instead of guessing.

Can an AI agent help debug flaky tests?

Yes, if it can read the failing run as structured context rather than a wall of log text. When the rebuilt run — DOM state, console errors, network requests, and timestamps — is exposed over a protocol like MCP, an AI coding agent such as Claude Code or Cursor can inspect the timeline, recognize whether the failure looks like an async-wait, order, or environment flake, and propose a targeted fix such as awaiting a settled state or isolating shared setup. The agent triages the class from evidence instead of a human re-tracing the run.

How long should I spend before quarantining a flaky test?

Set a hard time-box, usually one focused debugging session, then quarantine if you have not found the root cause. The goal of quarantine is to unblock the pipeline, not to forget the test. Move the flake to a non-blocking lane, file a ticket with the captured failing run attached, and schedule the fix. Leaving a flake red and blocking trains the team to ignore failures, which is how a real regression eventually ships unnoticed.

Flaky Tests

Debugging Flaky Tests: A Field Guide to Finding the Real Cause

Q: Should I just retry flaky tests automatically in CI?

Auto-retry hides the symptom and lets flakiness accumulate; it should be a stopgap, not the cure. The sustainable pattern, used by teams like GitHub and Atlassian, is to retry to confirm a failure is flaky, then quarantine the test so it stops blocking merges, then fix the root cause and return it to the suite. Atlassian's tooling quarantines flakes and tracks them; blanket retries without quarantine-and-fix just grow the problem, and flaky-test prevalence has been rising year over year.

Flaky tests are not random. Three named classes — async-wait timing, concurrency, and test-order dependency — cover roughly 77% of them. Categorize the failure, replay the run three ways, and the fix stops being a guess.

ManviJun 5, 20268 min read

Guides

Isometric line-art of one failing test run replayed across timing, order, and environment tracks converging on a lime replay node

A test that fails on Tuesday and passes on Wednesday with no code change is not telling you the code is broken. It is telling you the test made an assumption that does not always hold. The instinct is to hit re-run until it goes green. That works once and rots the suite forever: the flake stays, more pile up behind it, and eventually a real regression hides inside the noise.

The faster path is diagnosis, and flakiness is far more diagnosable than its reputation suggests. The failures fall into a small number of named classes, and each class has a tell you can force out with a deliberate re-run. This guide walks the categories, shows how to replay a failing run to identify the class, and ends on feeding that rebuilt run to an AI agent so it can triage the class for you.

What actually causes flaky tests?

Flaky tests are dominated by three named classes: async-wait timing problems at about 45%, concurrency and race conditions at about 20%, and test-order dependency at about 12%. Together they cover roughly 77% of cases in the canonical ten-category taxonomy. Categorizing the failure first is the single fastest route to a real fix instead of an endless re-run.

The reference for this is still Luo et al., An Empirical Analysis of Flaky Tests (FSE 2014), which classified 201 flaky-fix commits across Apache projects into ten root-cause categories. The long tail matters, but the head is what you triage first. An async-wait flake checks a result before an asynchronous operation has finished. A concurrency flake is a race condition, deadlock, or shared mutable state between threads. A test-order-dependency flake passes alone but fails when an earlier test leaves dirty global or database state behind.

Top flaky-test root causes (share of the Luo et al. taxonomy)

Async wait (timing)

45%

Concurrency / race

20%

Test-order dependency

12%

Other (resource, network, randomness, etc.)

23%

Source: Luo et al., "An Empirical Analysis of Flaky Tests," FSE 2014 (201 flaky-fix commits)

Two things follow from that distribution. First, most flakes are timing or ordering problems, not deep mysteries — they are fixable once named. Second, the categories map cleanly onto three ways of re-running a failure, which is the trick the next section turns into a procedure.

Replay the failing run three ways

You do not classify a flake by staring at the log. You force the class out by re-running the same failing commit under three deliberately different conditions and seeing which one makes it pass. This is precisely the method GitHub used to cut flaky builds 18x — they reran each failure in the same process, in the same process shifted into the future, and on a different host. Those three scenarios identified 90% of flaky failures automatically and dropped flaky-build commits from about 9% (1 in 11) to under 0.5% (1 in 200).

Feature	Rerun scenario	What it catches	Flake class
Same process, immediately	Identical env, re-execute now	Code-level randomness, a race that resolves on a second pass	Concurrency / race
Same process, time-shifted	Move the clock forward or inject a delay	Bad time assumptions, premature assertions before async work settles	Async wait (timing)
Different host	Re-run on a clean, separate machine	Shared state, leftover fixtures, env / resource / locale differences	Order or environment

GitHub's three rerun scenarios map one-to-one onto the flaky-test taxonomy. Run all three; the one that flips the result names the class.

The logic is subtractive. Passes only when re-run on a different host? Something on the original machine was dirty — leftover state from a prior test (order dependency) or an environment difference. Passes only after a delay or clock shift? The test asserted before the async operation finished (async-wait). Passes intermittently in the same process with nothing else changed? A race condition. You can wire this into CI as a small retry harness rather than running it by hand.

flake-classifier.tsts

// Re-run a single failing test three ways to classify the flake.
// Mirrors GitHub's same-process / time-shifted / different-host scenarios.

type Result = 'pass' | 'fail';

async function runOnce(testId: string, opts: {
  clockOffsetMs?: number; // shift time to expose async-wait bugs
  host?: 'same' | 'fresh'; // fresh host exposes order/env state
}): Promise<Result> {
  // ...invoke your runner (jest/vitest/playwright) with these conditions
  return 'pass';
}

export async function classify(testId: string) {
  const sameProcess = await runOnce(testId, { host: 'same' });
  const timeShifted = await runOnce(testId, { host: 'same', clockOffsetMs: 60_000 });
  const freshHost   = await runOnce(testId, { host: 'fresh' });

  if (freshHost === 'pass'  && sameProcess === 'fail') return 'order_or_environment';
  if (timeShifted === 'pass' && sameProcess === 'fail') return 'async_wait_timing';
  if (sameProcess === 'fail')                           return 'concurrency_race';
  return 'not_reproduced'; // could not trigger the flake this run
}

Where to look first: big, shared, networked tests

Not all tests are equally suspect. Google's analysis of 4.2 million test executions found a near-linear relationship between a test's size and its flakiness rate: small unit tests are rarely flaky, while large integration and emulator tests are the worst offenders. The blunt heuristic — "the larger the test, the more likely it is to be flaky" — is a genuinely useful triage filter. Sort your flaky failures by test size and dependency count, and start at the top.

There is a second filter: flakes cluster. Parry et al. (EASE 2025) re-ran 10,000 test-suite executions across 24 Java projects and found that 75% of the 810 flaky tests belonged to a co-occurring failure cluster with a mean size of 13.5 tests. Stack-trace inspection pinned intermittent networking and unstable external dependencies as the predominant shared cause. The implication is leverage: fix one shared resource — a flaky test container, an unmocked third-party call — and a dozen flakes go green at once.

Retry, quarantine, then fix — in that order

Auto-retry is a confirmation tool, not a cure. Used alone it hides the symptom and lets flakiness compound. The durable pattern — the one GitHub and Atlassian both run — is a pipeline: retry to confirm the failure is flaky, quarantine the test so it stops blocking merges, then fix the root cause and return it to the suite. Atlassian's Flakinator quarantines flakes across 12+ products and processes more than 350 million test executions per day; they trace roughly 15% of Jira backend build failures to flakiness, wasting over 150,000 developer hours per year. That is the cost of skipping the fix step.

And the problem is growing, not shrinking. The TestDino 2026 benchmark — drawing on Bitrise Mobile Insights across 10M+ builds from January 2022 to June 2025 — found the share of teams hitting test flakiness rose from 10% in 2022 to 26% in 2025. A 2026 GitHub Actions study of 1,960 Java projects independently found 3.2% of builds get rerun, 67.7% of those reruns are flaky, and flakiness affects 51% of projects — with network and dependency-resolution issues among the top causes, echoing the clustering finding. Quarantine-without-fix is a debt that compounds.

Share of teams experiencing test flakiness, 2022 vs 2025

2022

10%

2025

26%

Source: TestDino 2026 benchmark, citing Bitrise Mobile Insights (10M+ builds, Jan 2022–Jun 2025)

Let an AI agent triage the class from the rebuilt run

Here is the part the standard playbook stops short of. Categorize, replay, quarantine, fix — every guide teaches that loop as human work. But the three-way replay produces something an agent can read: a rebuilt failing run with DOM state, console errors, network requests, and timestamps lined up on one timeline. Expose that as structured context over MCP, and an AI coding agent (Claude Code, Cursor) can inspect the timeline directly instead of parsing a wall of log text.

This is BugMojo's wedge. The browser extension captures the failing run — session replay, console, and network — and the MCP server hands it to the agent as evidence, not prose. The agent reads whether the failure looks like an async-wait (an assertion firing before a network call settles), an order flake (state bleeding from a prior test), or an environment flake (a timing or locale difference), and drafts a targeted fix: await the settled state, isolate the shared setup, mock the unstable dependency. The same three-scenario logic GitHub ran by hand becomes evidence the agent classifies for you.

Feature	CI flaky-test analytics	Session-replay tools	BugMojo
Historical flaky-rate detection & dashboards	✓	—	early
Auto-quarantine across many repos	✓	—	partial
Rebuilt run: DOM + console + network on one timeline	logs only	✓	✓
Captures the failing run with one click	—	✓	✓
Failing run exposed to an AI agent over MCP	—	—	✓
Agent classifies flake class & drafts the fix	—	—	✓

Honest two-sided view. Mature flaky-test analytics platforms beat BugMojo on historical detection and dashboards today; the agent-readable replay is the row none of them have.

If you can replay the failing run, you can name the class. And once you can name the class, the flake is already half fixed.
BugMojo engineering

Turn your next flaky failure into agent-readable evidence

Install the free BugMojo extension to capture the failing run — replay, console, and network — and let an AI agent triage the flake class over MCP. No project setup required.

Install the extension

Frequently asked questions

Sources

Luo, Hariri, Eloussi, Marinov — "An Empirical Analysis of Flaky Tests" (the canonical 10-category taxonomy) — ACM SIGSOFT FSE 2014, University of Illinois (2014)
Parry, Kapfhammer, Hilton, McMinn — "Systemic Flakiness: An Empirical Analysis of Co-Occurring Flaky Test Failures" — arXiv / EASE 2025 (2025-04-23)
"Reducing flaky builds by 18x" — GitHub's three rerun scenarios and before/after flaky-commit rate — The GitHub Blog (2020-12-16)
"Where do our flaky tests come from?" — 4.2M tests analyzed; flakiness scales with test size — Google Testing Blog (2017-04-17)
"Taming Test Flakiness: How We Built a Scalable Tool to Detect and Manage Flaky Tests" (Flakinator) — Inside Atlassian (Engineering) (2025)
Ge & Zhang — "Understanding and Detecting Flaky Builds in GitHub Actions" (1,960 Java projects) — arXiv, Soochow University (2026-02-02)
"Flaky Test Benchmark Report 2026: Rates, Root Causes, and Cost Implications" — TestDino (2026)

Get bug-tracking insights, weekly.

Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.