When should you write an incident postmortem?

Write one whenever an incident crosses a defined threshold. Google SRE's common triggers are user-visible downtime or degradation past a set bar, data loss of any kind, on-call intervention such as a rollback or traffic reroute, resolution time over a threshold, or a monitoring failure that forced manual discovery. PagerDuty makes postmortems mandatory for every SEV-1 and SEV-2. Teams tune the exact numbers, but any stakeholder can also request a postmortem for an event they consider significant.

Incident playbook

Blameless Incident Postmortem: Template + Worked Example

Q: What sections should a postmortem template include?

A complete template covers a summary, customer and business impact, a detailed timeline, the trigger, detection and resolution, contributing factors (not a single root cause), a 'what went well / what didn't / where we got lucky' retrospective, and tracked action items with owners. PagerDuty's template adds an explicit Responders section and pre-drafted internal and external messaging. Each action item should map to a ticket so prevention work is visible and does not quietly disappear after the incident closes.

Q: What is the difference between a root cause and contributing factors?

A single 'root cause' implies one fault you can fix to prevent recurrence. Most real outages have several interacting causes — a config change, a missing alert, an ambiguous runbook, time pressure — so listing only one misses the changes that would help most. 'Contributing factors' captures the full set across technical, procedural, and cultural dimensions. PagerDuty's template uses Contributing Factors deliberately, and the Five Whys is a useful prompt as long as it branches into multiple causes rather than one chain.

Q: How long after an incident should the postmortem be done?

Soon, while memory and logs are fresh. PagerDuty schedules the postmortem review meeting within 3 calendar days for a SEV-1 and within 5 business days for a SEV-2. The Incident Commander names a postmortem owner at or just after the incident call; that owner reconstructs the timeline, gathers evidence, and drafts the document before the meeting so the group reviews a complete draft rather than starting from a blank page.

Q: How can AI agents help with incident postmortems?

AI coding agents can draft the first version. With BugMojo's MCP server, an agent like Claude Code or Cursor reads the captured rrweb session replay, console errors, and network requests for an incident, reconstructs the timeline, and proposes contributing factors and candidate action items directly in your editor. A human still owns the final blameless write-up and the judgment calls, but the agent removes the blank-page cost and grounds the draft in real captured evidence instead of memory.

A copy-paste blameless postmortem template, a worked example, and the difference between a single root cause and contributing factors — grounded in Google SRE and PagerDuty practice.

Hrishikesh BaidyaJun 5, 202611 min read

Playbooks

Isometric line-art of a blameless postmortem worksheet with branching contributing-factor nodes fed by session-replay and log evidence into an MCP AI-agent chip

A postmortem is the cheapest reliability investment you will ever make, and the easiest to do badly. Done well, it converts one painful hour into a durable set of fixes the whole org can see. Done badly, it becomes a blame ritual that teaches engineers to hide problems — which guarantees the next outage is worse.

This is the operational version: a copy-paste template, a fully worked example with real numbers, and a clear line between a single root cause and the contributing factors that actually explain most failures. It is built on two of the most-cited public sources on the topic, Google's SRE book and PagerDuty's incident-response docs, and it ends with how an AI agent can draft the first version from captured evidence.

What is a blameless postmortem?

A blameless postmortem is a written record of an incident that documents impact, timeline, contributing factors, and follow-up actions without blaming any person. It assumes everyone acted reasonably on the information they had, and asks what systemic conditions allowed the failure — not who made the mistake. It is a learning artifact, not a disciplinary one.

PagerDuty's blameless guidance is blunt about why this matters. It explicitly rejects the 'old view' that people's mistakes cause failure, and reframes human error as a symptom of a systemic problem. The practical move is in the grammar: steer the write-up toward 'what' and 'how' questions and away from blame-attributing 'why did you' questions. PagerDuty cites Etsy as a blameless pioneer, and the logic is self-interested rather than soft — an engineer who fears punishment will give you a vague, defensive account, and you cannot fix a system you cannot see honestly.

The economic case: why fast, repeatable postmortems pay

Downtime is expensive enough to justify a repeatable process. ITIC's 2024 survey of more than 1,000 firms found a single hour of downtime now exceeds $300,000 for over 90% of mid-size and large enterprises, and 41% of enterprises put the hourly cost between $1 million and $5 million or more. A postmortem that prevents one recurrence pays for many.

ITIC 2024: hourly cost of downtime (share of mid/large enterprises)

Over $300K / hour

90%

$1M–$5M+ / hour

41%

Source: ITIC 2024 Hourly Cost of Downtime Report (survey of 1,000+ firms, Nov 2023–Mar 2024)

The number is not an argument for ceremony. It is an argument for a process so light that doing one is never the bottleneck. If a postmortem takes a week of meetings, teams skip them under pressure — exactly when the lessons are most valuable. The template below is meant to be filled in by one owner in an afternoon, then reviewed by the group, not authored by committee from a blank page.

When to write one: the five SRE triggers

Google SRE lists five common triggers that should automatically require a postmortem: user-visible downtime or degradation beyond a threshold, data loss of any kind, on-call engineer intervention such as a rollback or traffic reroute, resolution time above a threshold, and a monitoring failure that implies a problem was found manually. Any stakeholder can also request one.

Codifying triggers removes the worst conversation in incident response: arguing, after the fact, about whether an event 'deserves' a postmortem. If it tripped a trigger, it gets one. PagerDuty layers a severity rule on top — postmortems are mandatory for every SEV-1 and SEV-2, no exceptions — and the Incident Commander names a single postmortem owner so accountability for the document is never diffuse.

User-visible downtime or degradation past your agreed bar (error rate, latency, availability).
Data loss of any kind. No threshold — any is enough.
On-call intervention: a human had to roll back, reroute traffic, or fail over.
Resolution time over a threshold — even a contained incident that took too long to resolve.
Monitoring failure: you found out from a customer or by chance, not from an alert.

Root cause vs contributing factors

A single root cause implies one fault you can fix to stop recurrence. Most real outages have several interacting causes: a config change, a missing alert, an ambiguous runbook, time pressure. Listing only one misses the changes that would help most. Contributing factors capture the full set across technical, procedural, and cultural dimensions — which is why PagerDuty's template uses that heading deliberately.

PagerDuty's incident-response template does not have a 'root cause' box. It has Contributing Factors, plural, on purpose. The Five Whys is still a useful prompt — as long as it branches into multiple causes rather than collapsing into a single tidy chain. A good test: if your postmortem names exactly one cause and one fix, you probably stopped asking too early.

Sort contributing factors into three buckets so the action items distribute across the system, not just the code:

Technical — the bug, the config, the missing index, the unhandled error path.
Procedural — the runbook gap, the missing alert, the deploy that skipped staging, the unclear ownership.
Cultural / organizational — the time pressure, the alert fatigue, the tribal knowledge that wasn't written down.

Reframe human error as a symptom of a systemic problem, and ask 'what' and 'how' instead of 'why did you'.
PagerDuty — The Blameless Postmortem

The copy-paste template

A complete template covers summary, impact, timeline, trigger, detection, resolution, contributing factors (plural), a what-went-well retrospective, and tracked action items with owners. PagerDuty adds explicit Responders and pre-drafted messaging sections. Every action item maps to a ticket so prevention work stays visible and does not quietly disappear once the incident closes.

This merges the section lists from Google SRE's example postmortem and PagerDuty's template. Paste it into your incident tool, fill it in, keep the headings even when a section is short — a deliberately empty 'Where we got lucky' is itself a signal.

postmortem-template.mdmarkdown

# Postmortem: [short incident name]

**Status:** Draft | In review | Final
**Severity:** SEV-1 | SEV-2 | SEV-3
**Date:** [incident date]   **Authors / owner:** [postmortem owner]
**Reviewers:** [names]   **Incident Commander:** [name]

## Summary
[2-3 sentences: what broke, who it affected, how long. Readable by someone
outside the team.]

## Impact
- Customer impact: [who, how many, what they couldn't do]
- Business impact: [revenue / SLA / data — quantify if you can]
- Duration: [start -> mitigated -> fully resolved]

## Trigger
[The single event that kicked it off — a deploy, a traffic spike, a config
change. Distinct from the contributing factors.]

## Detection
[How did we find out? Alert / customer / manual? If manual, that's a
monitoring-failure action item.]

## Resolution
[What actually mitigated it, in order. Note the on-call interventions —
rollback, reroute, failover.]

## Contributing Factors
(Plural on purpose — not a single root cause.)
- Technical: [bug / config / missing guardrail]
- Procedural: [runbook gap / missing alert / skipped staging]
- Cultural: [time pressure / alert fatigue / tribal knowledge]

## Timeline (all times <TZ>)
| Time  | Event                                             |
|-------|---------------------------------------------------|
| 00:00 | [trigger]                                          |
| 00:0X | [detection]                                        |
| 00:XX | [mitigation]                                       |
| 00:YY | [full resolution]                                 |

## How did we do? (Retrospective)
- What went well:
- What went wrong:
- Where we got lucky:

## Action Items
| # | Action item            | Type     | Owner | Ticket    | Due  |
|---|------------------------|----------|-------|-----------|------|
| 1 | [fix]                  | prevent  | @x    | SEV2-1234 | [d]  |
| 2 | [add alert]            | mitigate | @y    | SEV2-1235 | [d]  |
| 3 | [update runbook]       | process  | @z    | SEV2-1236 | [d]  |

(Type = mitigate | prevent | process | other — Google SRE's tags.)

## Messaging
- Internal:
- External / customer-facing:

Two details in that template do real work. First, PagerDuty requires every action item to be filed as a ticket tagged with both sevN and a dated sevN_YYYYMMDD label, so follow-ups are trackable and a stale incident's open items surface in a single query. Second, Google's example tags each action item by type — mitigate, prevent, process, or other — which makes it obvious at a glance whether you actually invested in prevention or only patched the symptom.

Action items that don't rot

The single most common postmortem failure mode is that the document is excellent and nothing changes. Action items are where that is won or lost. Three rules:

One ticket per item, with an owner and a due date. An action item that lives only in the postmortem prose is a wish, not a task.
Tag by type (mitigate / prevent / process / other). If every item is 'mitigate', you have not prevented the next occurrence.
Tag with the incident (sevN_YYYYMMDD) so you can audit completion weeks later and so the prevention backlog is visible to leadership.

For the upstream decisions — what severity to assign and how fast to respond before you ever reach the write-up — see our 15-minute triage playbook and the severity vs priority framework.

Worked example: Google's Shakespeare outage

Google's published example, the Shakespeare outage, is fully quantified: a 66-minute incident with an estimated 1.21 billion queries lost and no revenue impact. Its action items are explicitly tagged by type — mitigate, prevent, process, other — each with an owner and a bug-tracker ID. It is the reference for what 'good' looks like when filling in the template above.

Mapping the real example onto the template makes the abstract sections concrete:

Mapping Google SRE's published Shakespeare example onto the template sections shows what good looks like in practice. Note the quantified impact and the typed action items:

Template section	From Google's example postmortem
Summary	A new Shakespeare sonnet triggered a cascading query overload; service was down for a class of users.
Impact (quantified)	66 minutes; about 1.21 billion queries lost; no revenue impact recorded.
Trigger	A spike in queries for a newly discovered sonnet exceeded provisioned capacity.
Detection	An automated alert fired (not manual), so there is no monitoring-failure action item.
Contributing factors	Multiple: capacity headroom, a latent bug surfaced under load, and a slow human escalation path.
Action items	Each tagged mitigate, prevent, process, or other, each with an owner and a bug ID.
Sections present	Summary, Impact, Root Causes, Trigger, Resolution, Detection, Action Items, Lessons Learned, Timeline.

The lesson is not the sonnet. It is that a credible postmortem puts a number on impact (66 minutes, 1.21B queries), names multiple contributing factors, and ties every action item to an owner and a ticket. If your draft has none of those three, it is a story, not a postmortem.

Timing: SLA the write-up

Write the postmortem while memory and logs are fresh. PagerDuty schedules the review meeting within 3 calendar days for a SEV-1 and within 5 business days for a SEV-2. The Incident Commander names a postmortem owner at or just after the incident call; that owner reconstructs the timeline, gathers evidence, and drafts the document before the meeting so the group reviews a complete draft.

The owner-driven drafting workflow matters more than the exact SLA numbers. A review meeting that starts from a blank page degenerates into a group writing exercise — slow, unfocused, and prone to the loudest voice. A review meeting that starts from a complete draft is a quality check: correct the timeline, challenge the contributing factors, sharpen the action items, done. PagerDuty's step-by-step guidance puts timeline reconstruction and analysis before the meeting for exactly this reason.

Where AI agents fit: draft from evidence, human owns the judgment

AI coding agents can write the first draft. With BugMojo's MCP server, an agent like Claude Code or Cursor reads the incident's captured rrweb session replay, console errors, and network requests, reconstructs the timeline, and proposes contributing factors and candidate action items in your editor. A human still owns the blameless write-up and the judgment calls; the agent removes the blank-page cost.

The expensive, error-prone part of the owner's job is reconstruction: stitching a timeline from logs, recalling what was on screen, finding the failed network call. That is precisely the part grounded in artifacts an agent can read. BugMojo's browser extension captures the rrweb DOM replay, console logs, and network requests for an error at the moment it happens; its MCP server exposes those to an agent. So instead of reconstructing the timeline from memory, the agent reconstructs it from the actual recording and pre-fills the template's Timeline and Contributing Factors sections.

The honest boundary, which the generic 'AI writes your postmortem' pages skip: the agent drafts, the human owns. Deciding which contributing factor matters most, keeping the language blameless, and committing the team to prevention work are judgment calls. The agent's value is removing the blank page and grounding the draft in first-hand evidence — not making the calls.

Feature	Manual (memory-based)	BugMojo + MCP agent	Error monitor (e.g. Sentry)
Copy-paste template	You supply it	✓	Partial
Timeline reconstructed from captured replay	—	✓	Breadcrumbs only
Console + network attached to the incident	—	✓	✓
AI agent drafts contributing factors from evidence (MCP)	—	✓	—
Mature production error-rate monitoring at scale	—	Not yet	✓
Deep alerting / on-call paging	—	—	Via integrations
Human owns the blameless final write-up	✓	✓	✓

Two-sided: the MCP agent-drafting row is the BugMojo wedge no competitor has — but a dedicated error monitor still beats BugMojo on mature production error-rate monitoring and deep on-call paging.

Common mistakes

Naming a single root cause. Real outages have several. One cause and one fix usually means you stopped asking too early.
Blame leaking in. 'Why did you deploy on Friday' is a blame question. 'What in the system allowed an unreviewed Friday deploy' is a postmortem question.
Action items with no ticket. Prose-only items rot. One ticket, one owner, one due date, tagged to the incident.
Skipping the postmortem because impact was small. A monitoring miss with tiny impact is still a defect — the missing alert is the whole point.
Drafting in the meeting. The owner drafts first; the group reviews. A blank page in a room of ten is the slowest way to write anything.

Next steps

Paste the template into your incident tool and hardcode the five SRE triggers into your 'does this need a postmortem?' check.
Add the SLAs to your runbook: SEV-1 review in 3 calendar days, SEV-2 in 5 business days, owner named on the incident call.
Capture replay + console + network in production so the timeline reconstructs from evidence, not memory — then let an agent pre-draft via MCP. Read the upstream 15-minute triage playbook for the steps before the write-up.

Let the agent draft the postmortem from real evidence

BugMojo captures rrweb session replay, console, and network for every error, and its MCP server lets Claude Code or Cursor reconstruct the timeline and draft contributing factors — so your owner reviews a draft, not a blank page.

Install the extension

Frequently asked questions

Sources

Google SRE Book — Postmortem Culture: Learning from Failure (Ch. 15) — Google (accessed 2026-06)
Google SRE Book — Example Postmortem (the Shakespeare outage) — Google (accessed 2026-06)
PagerDuty Incident Response — Postmortem Template — PagerDuty (accessed 2026-06)
PagerDuty Incident Response — Postmortem Process (SEV-1/SEV-2 SLAs) — PagerDuty (accessed 2026-06)
PagerDuty Postmortem Docs — The Blameless Postmortem — PagerDuty (accessed 2026-06)
ITIC 2024 Hourly Cost of Downtime Report — Information Technology Intelligence Consulting (ITIC) (2024)

Get bug-tracking insights, weekly.

Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.