Why can't a developer reproduce my bug even though the steps look complete?

Usually because the steps capture what you did but not the state you were in. A large empirical study of 576 non-reproducible bug reports found the most common causes are insufficient information in the report and inter-environmental differences: browser version, OS, viewport, feature flags, cached data, account permissions, or a race between two network calls that only loses on a slow connection. Written steps rarely encode any of that. A session replay that carries the DOM, console, and network of the actual failing session removes the guesswork because the environment travels with the report.

What is the difference between steps to reproduce and a session replay?

Steps to reproduce are your reconstruction of what happened, written from memory after the fact. A session replay is the recording of what actually happened: every DOM mutation, console error, and network request, lined up on one timeline. Steps tell the developer where to click; a replay shows them the click, the failed render, and the stack trace at the same timestamp. They are complementary. The strongest reports lead with a short numbered repro for human scanning and attach the replay as the ground-truth backup when the steps alone don't reproduce.

Can an AI coding agent use steps to reproduce to fix a bug automatically?

Increasingly, yes, but prose steps are a weak interface for a machine. Tools like Chrome DevTools MCP (released September 2025) already let an agent navigate a page, fill forms, click, and inspect console and network to reproduce an issue itself. The bigger unlock is exposing the captured reproduction (replay, console, network, environment) as structured context over the Model Context Protocol, so an agent like Claude Code or Cursor reads the exact failing session and drafts a fix without a human re-tracing the steps. Plain-English steps still help, but machine-readable repro context is what agents act on reliably.

What should I do when a bug only happens sometimes and I can't reproduce it?

Stop trying to write deterministic steps for a non-deterministic bug, and capture instead of reconstruct. Intermittent failures (races, timing, cache state, flaky network) are exactly the class that 'steps to reproduce' fails to convey, which is why they dominate non-reproducible bug studies. Keep a session recorder running so that when the bug next fires you already have the DOM, console, and network for that specific occurrence. File the report the moment it recurs with that captured artifact attached, and note the conditions you suspect (slow connection, second tab open, just after login) so the triager can narrow the environment.

Bug reporting

Steps to Reproduce: The Skill That Separates Good Bug Reports From Ignored Ones

Q: How detailed should steps to reproduce be?

Detailed enough that someone with zero context reproduces the bug on the first try. In practice that means 4-8 numbered steps that start from an explicit known state (logged out, fresh tab, a specific URL), name unambiguous click targets, and state expected-versus-actual at the exact step that fails. If a step needs data setup, link the seed or test account rather than describing it vaguely. Over-detail ('I was also playing music') is as useless as under-detail ('checkout broke'). The test is reproducibility, not word count.

Steps to reproduce is the one bug-report skill worth drilling: the right altitude, an explicit known state, expected-vs-actual at the failing step, and a captured replay for the intermittent ones agents can act on.

ManviJun 5, 202610 min read

Guides

Thin lime line-art browser window dissolving into a replay timeline, with three numbered repro-step nodes snapping toward a glowing error marker

Every other field on a bug report is negotiable. A vague title gets renamed in triage; a missing screenshot gets requested in a comment. The steps to reproduce are different — when they are wrong, the bug does not get fixed, it gets closed. A developer who cannot make the bug happen on their machine has two moves: ask you for more information, or mark it 'works for me' and move on. Both cost days.

This guide is about that one skill in isolation. Not the whole report template (that lives in the six-field bug report companion), and not how replay serialization works under the hood (covered in session replay for debugging). Just the writing: how to pitch the altitude, anchor the starting state, mark the exact failure, and — the part most guides skip — what to do for the bugs that refuse to reproduce on command.

What makes steps to reproduce good?

Good steps to reproduce let someone with zero context reproduce the bug on the first try. They start from an explicit known state, use 4-8 numbered steps with unambiguous click targets, and state expected-versus-actual at the exact step that fails. The measure is reproducibility, not word count. When steps alone cannot encode the environment, a captured replay carries it instead.

That definition is doing more work than it looks. 'Zero context' rules out shorthand only you understand. 'First try' rules out steps that work two times in five. 'Explicit known state' is the rule everyone forgets, and it is the one that breaks reproduction most often. We will take each apart below — but first, the evidence that the repro gap is real and expensive.

The repro gap is the most common reason bugs die

The strongest data on this is academic, not vendor marketing. A 2022 empirical study in Empirical Software Engineering examined 576 non-reproducible bug reports — 250 from Mozilla Firefox Core and 326 from Eclipse JDT — and identified 11 distinct factors behind non-reproducibility. The two most common were insufficient information in the report and inter-environmental differences. When a bug landed as non-reproducible, developers responded one of two ways: close it outright, or solicit more information through long, counter-productive manual searches.

Read that back. The single most common thing missing when a bug dies as 'works for me' is the thing this page is about — a usable reproduction. Not a cleverer fix, not a better stack trace. The steps.

That field artifact is one team's anecdote, not a survey — treat it as illustrative. But it names the failure mode precisely: the fix was cheap; the round-trips to reach a reproducible state were not. Tricentis's 2025 Quality Transformation Report (2,700+ software-delivery practitioners) frames the same loop at industry scale — 33% of organizations point specifically to poor communication and weak feedback loops between developers and testers as a top quality problem, while 40% say poor software quality costs them $1M or more annually. The dev/tester feedback loop is exactly where steps to reproduce live.

Where the four-day bug actually went (one documented case)

Clarification + back-and-forth

230

Waiting / reassignment

100

Actual fix

Values are minutes-equivalent, derived from the documented case (2.3 average clarification rounds, a thirty-minute fix). The shape is the point: the fix is a sliver; the repro gap is the bar. Source: QA meets AI, March 2026.

The four rules of a reproducible repro

1. Get the altitude right

Altitude is the level of detail. Fly too high — 'checkout broke' — and the developer has no path to start. Fly too low — 'I was on the 14:32 train, also playing music, and I think I'd had this tab open since Monday' — and you have buried the one relevant detail in ten irrelevant ones. The non-reproducibility study found over-described and under-described reports fail for the same reason: neither lets a stranger retrace the path.

The right altitude is the level at which each step is an action another person can perform identically. 'Add an item to the cart' is the right altitude. 'Broke' is too high. 'Move the mouse to coordinate 412, 290' is too low.

2. Anchor an explicit known state

This is the rule that separates steps that reproduce from steps that look complete and do not. Most repros silently assume the state the reporter happened to be in: logged in as an admin, a cart with two items already in it, a feature flag flipped, a stale cache. The reader starts from a different state and the bug never fires.

Begin from a state anyone can recreate. Logged out, or logged in as a named test role. A fresh tab. A specific URL, not 'the dashboard'. If the bug needs data, link the seed script or the test account rather than describing it. The empirical record is blunt here: inter-environmental differences were among the top causes of non-reproducibility, and an unstated starting state is the cheapest one to eliminate.

good-repro.txttext

Environment: Chrome 137, macOS 14.5, viewport 1440x900, build a1b2c3d

1. Sign in as the seeded free-tier user (login: qa+free@example.com / see 1Password)
2. Open https://app.example.com/projects/new   ← fresh tab, not from the dashboard
3. Type "Test" in the Name field
4. Click Save
   → Expected: project is created, redirect to /projects/<id>
   → Actual:   Save button spins indefinitely; no network request fires

Notes: only reproduces on free-tier; paid accounts redirect correctly.

Notice what the block encodes that prose usually drops: the explicit account and tier, the fresh-tab caveat inline at step 2, and a boundary condition (free-tier only) that tells the triager where not to look. That last line alone can save a clarification round.

3. State expected vs actual at the failing step

Put the expectation where the failure happens, not in a separate paragraph at the bottom. The two-line → Expected / → Actual pattern, attached to the exact step that breaks, does two jobs. It pins the failure to one action instead of a vague 'somewhere in this flow'. And it forces you to articulate the feature's intent, which is how 'this is actually working as designed' disagreements get caught before they become ticket ping-pong.

4. Use 4-8 steps, no more

Granularity is a forcing function. If your repro runs to fifteen steps, either it is at too low an altitude (collapse them) or it is actually two bugs (split them). Four to eight numbered steps is the band where a repro stays scannable and still starts from a known state. One ticket, one bug, one lifecycle.

When steps to reproduce are not enough: capture, don't reconstruct

Here is the limit of the skill. Written steps record what you did. They cannot record the state you were in — the browser build, the viewport, the feature-flag matrix, the cache, the account permissions, or a race between two requests that only loses on a slow connection. That is precisely the class of bug the non-reproducibility study found dominates 'works for me' closures, and no amount of careful prose fixes it, because the missing information was never in your fingers to begin with.

For that class, stop reconstructing and start capturing. A session replay records the DOM, console, and network of the actual failing session and lines them up on one timeline, so the environment travels with the report. Steps tell the developer where to click; the replay shows them the click, the failed render, and the stack trace at the same timestamp. They are complementary — lead with a short numbered repro for human scanning, attach the replay as the ground truth for when the steps alone fall short.

Feature	Written steps	Captured replay
Scannable in five seconds	✓	partial
Works for the human triager	✓	✓
Encodes the exact environment	—	✓
Survives an intermittent / race bug	—	✓
Carries console + network at the failure	—	✓
Readable by an AI agent over MCP	text only	✓

Neither column wins outright, which is the honest takeaway. Written steps are unbeatable for a human scanning a queue — a developer reads four numbered lines faster than they scrub a recording. The replay wins everywhere the environment matters and everywhere a machine is the reader. The strongest reports carry both.

Steps to reproduce are becoming an interface an agent executes

Until recently, 'steps to reproduce' had exactly one consumer: a human reading prose. That assumption is breaking. Chrome DevTools MCP, released September 23, 2025, lets an AI coding agent 'navigate, fill out forms, and click buttons to reproduce bugs and test complex user flows — all while inspecting the runtime environment,' reading console logs and network requests on a live page. The repro stopped being only prose a human reads; it became an action an agent performs.

Prose, though, is a weak interface for a machine. An agent can fumble through ambiguous English steps, but it acts reliably on structured context: the replay, the console, the network, and the environment exposed over the Model Context Protocol. Anthropic introduced MCP in November 2024 as an open standard precisely so agents could read external tools and data this way. Feed an agent like Claude Code or Cursor the exact failing session as structured context and it can reproduce and draft a fix without a human re-tracing a single step.

This is the BugMojo wedge, and it is worth stating plainly because no prose-only repro guide can claim it. BugMojo's browser extension captures the rrweb session replay, console logs, network requests, and screenshot at the moment a bug is reported, and its MCP server exposes that captured reproduction to AI agents as structured context. The steps you write stay useful for the human in the queue; the captured artifact is what an agent reads to act. To be straight about the boundary: BugMojo is a capture-and-repro layer, not a mature production error-monitoring suite — if you need deep release-health dashboards or aggregate crash analytics, that is a Sentry-class job, not this one.

Write the steps for the human in the queue. Capture the repro for the machine that fixes it. The first costs you two minutes; the second costs you a browser click.
BugMojo engineering

A repro checklist you can paste into your template

Before you submit, run the list. If any line is a 'no', the bug is a clarification round away from stalling.

Known state — does step 1 start from a state a stranger can recreate (logged out, or a named seeded account)?
Altitude — is each step an action another person performs identically, no higher, no lower?
4-8 steps — if longer, is this secretly two bugs?
Expected vs actual — stated inline at the exact failing step?
Environment — browser, OS, viewport, build (or autocaptured)?
Boundary — any 'only on X' condition noted, to tell the triager where not to look?
Capture — for anything intermittent or visual, is a replay attached as the ground-truth backup?

Stop reconstructing repros from memory

Install the extension

Frequently asked questions

Sources

Works for Me! Cannot Reproduce — A Large Scale Empirical Study of Non-reproducible Bugs — Empirical Software Engineering (Rahman, Khomh, Castelluccio) (2022)
Chrome DevTools (MCP) for your AI agent — reproduce bugs, inspect console + network — Chrome for Developers (Google) (2025-09-23)
2025 Quality Transformation Report — key findings — Tricentis (2025-05)
Google DORA 2024 — AI impact summary (throughput vs rework/instability) — GitClear (Google DORA data) (2024-2025)
The bug report that took four days to fix — Ali El-Shayeb, QA meets AI (Medium) (2026-03-12)
Introducing the Model Context Protocol — Anthropic (2024-11)

Get bug-tracking insights, weekly.

Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.