Blameless Incident Postmortem: Template + Worked Example
A copy-paste blameless postmortem template, a worked example, and the difference between a single root cause and contributing factors — grounded in Google SRE and PagerDuty practice.

A postmortem is the cheapest reliability investment you will ever make, and the easiest to do badly. Done well, it converts one painful hour into a durable set of fixes the whole org can see. Done badly, it becomes a blame ritual that teaches engineers to hide problems — which guarantees the next outage is worse.
This is the operational version: a copy-paste template, a fully worked example with real numbers, and a clear line between a single root cause and the contributing factors that actually explain most failures. It is built on two of the most-cited public sources on the topic, Google's SRE book and PagerDuty's incident-response docs, and it ends with how an AI agent can draft the first version from captured evidence.
What is a blameless postmortem?
A blameless postmortem is a written record of an incident that documents impact, timeline, contributing factors, and follow-up actions without blaming any person. It assumes everyone acted reasonably on the information they had, and asks what systemic conditions allowed the failure — not who made the mistake. It is a learning artifact, not a disciplinary one.
PagerDuty's blameless guidance is blunt about why this matters. It explicitly rejects the 'old view' that people's mistakes cause failure, and reframes human error as a symptom of a systemic problem. The practical move is in the grammar: steer the write-up toward 'what' and 'how' questions and away from blame-attributing 'why did you' questions. PagerDuty cites Etsy as a blameless pioneer, and the logic is self-interested rather than soft — an engineer who fears punishment will give you a vague, defensive account, and you cannot fix a system you cannot see honestly.
The economic case: why fast, repeatable postmortems pay
Downtime is expensive enough to justify a repeatable process. ITIC's 2024 survey of more than 1,000 firms found a single hour of downtime now exceeds $300,000 for over 90% of mid-size and large enterprises, and 41% of enterprises put the hourly cost between $1 million and $5 million or more. A postmortem that prevents one recurrence pays for many.
The number is not an argument for ceremony. It is an argument for a process so light that doing one is never the bottleneck. If a postmortem takes a week of meetings, teams skip them under pressure — exactly when the lessons are most valuable. The template below is meant to be filled in by one owner in an afternoon, then reviewed by the group, not authored by committee from a blank page.
When to write one: the five SRE triggers
Google SRE lists five common triggers that should automatically require a postmortem: user-visible downtime or degradation beyond a threshold, data loss of any kind, on-call engineer intervention such as a rollback or traffic reroute, resolution time above a threshold, and a monitoring failure that implies a problem was found manually. Any stakeholder can also request one.
Codifying triggers removes the worst conversation in incident response: arguing, after the fact, about whether an event 'deserves' a postmortem. If it tripped a trigger, it gets one. PagerDuty layers a severity rule on top — postmortems are mandatory for every SEV-1 and SEV-2, no exceptions — and the Incident Commander names a single postmortem owner so accountability for the document is never diffuse.
- User-visible downtime or degradation past your agreed bar (error rate, latency, availability).
- Data loss of any kind. No threshold — any is enough.
- On-call intervention: a human had to roll back, reroute traffic, or fail over.
- Resolution time over a threshold — even a contained incident that took too long to resolve.
- Monitoring failure: you found out from a customer or by chance, not from an alert.
Root cause vs contributing factors
A single root cause implies one fault you can fix to stop recurrence. Most real outages have several interacting causes: a config change, a missing alert, an ambiguous runbook, time pressure. Listing only one misses the changes that would help most. Contributing factors capture the full set across technical, procedural, and cultural dimensions — which is why PagerDuty's template uses that heading deliberately.
PagerDuty's incident-response template does not have a 'root cause' box. It has Contributing Factors, plural, on purpose. The Five Whys is still a useful prompt — as long as it branches into multiple causes rather than collapsing into a single tidy chain. A good test: if your postmortem names exactly one cause and one fix, you probably stopped asking too early.
Sort contributing factors into three buckets so the action items distribute across the system, not just the code:
- Technical — the bug, the config, the missing index, the unhandled error path.
- Procedural — the runbook gap, the missing alert, the deploy that skipped staging, the unclear ownership.
- Cultural / organizational — the time pressure, the alert fatigue, the tribal knowledge that wasn't written down.
PagerDuty — The Blameless PostmortemReframe human error as a symptom of a systemic problem, and ask 'what' and 'how' instead of 'why did you'.
The copy-paste template
A complete template covers summary, impact, timeline, trigger, detection, resolution, contributing factors (plural), a what-went-well retrospective, and tracked action items with owners. PagerDuty adds explicit Responders and pre-drafted messaging sections. Every action item maps to a ticket so prevention work stays visible and does not quietly disappear once the incident closes.
This merges the section lists from Google SRE's example postmortem and PagerDuty's template. Paste it into your incident tool, fill it in, keep the headings even when a section is short — a deliberately empty 'Where we got lucky' is itself a signal.
# Postmortem: [short incident name]
**Status:** Draft | In review | Final
**Severity:** SEV-1 | SEV-2 | SEV-3
**Date:** [incident date] **Authors / owner:** [postmortem owner]
**Reviewers:** [names] **Incident Commander:** [name]
## Summary
[2-3 sentences: what broke, who it affected, how long. Readable by someone
outside the team.]
## Impact
- Customer impact: [who, how many, what they couldn't do]
- Business impact: [revenue / SLA / data — quantify if you can]
- Duration: [start -> mitigated -> fully resolved]
## Trigger
[The single event that kicked it off — a deploy, a traffic spike, a config
change. Distinct from the contributing factors.]
## Detection
[How did we find out? Alert / customer / manual? If manual, that's a
monitoring-failure action item.]
## Resolution
[What actually mitigated it, in order. Note the on-call interventions —
rollback, reroute, failover.]
## Contributing Factors
(Plural on purpose — not a single root cause.)
- Technical: [bug / config / missing guardrail]
- Procedural: [runbook gap / missing alert / skipped staging]
- Cultural: [time pressure / alert fatigue / tribal knowledge]
## Timeline (all times <TZ>)
| Time | Event |
|-------|---------------------------------------------------|
| 00:00 | [trigger] |
| 00:0X | [detection] |
| 00:XX | [mitigation] |
| 00:YY | [full resolution] |
## How did we do? (Retrospective)
- What went well:
- What went wrong:
- Where we got lucky:
## Action Items
| # | Action item | Type | Owner | Ticket | Due |
|---|------------------------|----------|-------|-----------|------|
| 1 | [fix] | prevent | @x | SEV2-1234 | [d] |
| 2 | [add alert] | mitigate | @y | SEV2-1235 | [d] |
| 3 | [update runbook] | process | @z | SEV2-1236 | [d] |
(Type = mitigate | prevent | process | other — Google SRE's tags.)
## Messaging
- Internal:
- External / customer-facing:Two details in that template do real work. First, PagerDuty requires every action item to be filed as a ticket tagged with both sevN and a dated sevN_YYYYMMDD label, so follow-ups are trackable and a stale incident's open items surface in a single query. Second, Google's example tags each action item by type — mitigate, prevent, process, or other — which makes it obvious at a glance whether you actually invested in prevention or only patched the symptom.
Action items that don't rot
The single most common postmortem failure mode is that the document is excellent and nothing changes. Action items are where that is won or lost. Three rules:
- One ticket per item, with an owner and a due date. An action item that lives only in the postmortem prose is a wish, not a task.
- Tag by type (mitigate / prevent / process / other). If every item is 'mitigate', you have not prevented the next occurrence.
- Tag with the incident (
sevN_YYYYMMDD) so you can audit completion weeks later and so the prevention backlog is visible to leadership.
For the upstream decisions — what severity to assign and how fast to respond before you ever reach the write-up — see our 15-minute triage playbook and the severity vs priority framework.
Worked example: Google's Shakespeare outage
Google's published example, the Shakespeare outage, is fully quantified: a 66-minute incident with an estimated 1.21 billion queries lost and no revenue impact. Its action items are explicitly tagged by type — mitigate, prevent, process, other — each with an owner and a bug-tracker ID. It is the reference for what 'good' looks like when filling in the template above.
Mapping the real example onto the template makes the abstract sections concrete:
Mapping Google SRE's published Shakespeare example onto the template sections shows what good looks like in practice. Note the quantified impact and the typed action items:
| Template section | From Google's example postmortem |
|---|---|
| Summary | A new Shakespeare sonnet triggered a cascading query overload; service was down for a class of users. |
| Impact (quantified) | 66 minutes; about 1.21 billion queries lost; no revenue impact recorded. |
| Trigger | A spike in queries for a newly discovered sonnet exceeded provisioned capacity. |
| Detection | An automated alert fired (not manual), so there is no monitoring-failure action item. |
| Contributing factors | Multiple: capacity headroom, a latent bug surfaced under load, and a slow human escalation path. |
| Action items | Each tagged mitigate, prevent, process, or other, each with an owner and a bug ID. |
| Sections present | Summary, Impact, Root Causes, Trigger, Resolution, Detection, Action Items, Lessons Learned, Timeline. |
The lesson is not the sonnet. It is that a credible postmortem puts a number on impact (66 minutes, 1.21B queries), names multiple contributing factors, and ties every action item to an owner and a ticket. If your draft has none of those three, it is a story, not a postmortem.
Timing: SLA the write-up
Write the postmortem while memory and logs are fresh. PagerDuty schedules the review meeting within 3 calendar days for a SEV-1 and within 5 business days for a SEV-2. The Incident Commander names a postmortem owner at or just after the incident call; that owner reconstructs the timeline, gathers evidence, and drafts the document before the meeting so the group reviews a complete draft.
The owner-driven drafting workflow matters more than the exact SLA numbers. A review meeting that starts from a blank page degenerates into a group writing exercise — slow, unfocused, and prone to the loudest voice. A review meeting that starts from a complete draft is a quality check: correct the timeline, challenge the contributing factors, sharpen the action items, done. PagerDuty's step-by-step guidance puts timeline reconstruction and analysis before the meeting for exactly this reason.
Where AI agents fit: draft from evidence, human owns the judgment
AI coding agents can write the first draft. With BugMojo's MCP server, an agent like Claude Code or Cursor reads the incident's captured rrweb session replay, console errors, and network requests, reconstructs the timeline, and proposes contributing factors and candidate action items in your editor. A human still owns the blameless write-up and the judgment calls; the agent removes the blank-page cost.
The expensive, error-prone part of the owner's job is reconstruction: stitching a timeline from logs, recalling what was on screen, finding the failed network call. That is precisely the part grounded in artifacts an agent can read. BugMojo's browser extension captures the rrweb DOM replay, console logs, and network requests for an error at the moment it happens; its MCP server exposes those to an agent. So instead of reconstructing the timeline from memory, the agent reconstructs it from the actual recording and pre-fills the template's Timeline and Contributing Factors sections.
The honest boundary, which the generic 'AI writes your postmortem' pages skip: the agent drafts, the human owns. Deciding which contributing factor matters most, keeping the language blameless, and committing the team to prevention work are judgment calls. The agent's value is removing the blank page and grounding the draft in first-hand evidence — not making the calls.
| Feature | Manual (memory-based) | BugMojo + MCP agent | Error monitor (e.g. Sentry) |
|---|---|---|---|
| Copy-paste template | You supply it | ✓ | Partial |
| Timeline reconstructed from captured replay | — | ✓ | Breadcrumbs only |
| Console + network attached to the incident | — | ✓ | ✓ |
| AI agent drafts contributing factors from evidence (MCP) | — | ✓ | — |
| Mature production error-rate monitoring at scale | — | Not yet | ✓ |
| Deep alerting / on-call paging | — | — | Via integrations |
| Human owns the blameless final write-up | ✓ | ✓ | ✓ |
Common mistakes
- Naming a single root cause. Real outages have several. One cause and one fix usually means you stopped asking too early.
- Blame leaking in. 'Why did you deploy on Friday' is a blame question. 'What in the system allowed an unreviewed Friday deploy' is a postmortem question.
- Action items with no ticket. Prose-only items rot. One ticket, one owner, one due date, tagged to the incident.
- Skipping the postmortem because impact was small. A monitoring miss with tiny impact is still a defect — the missing alert is the whole point.
- Drafting in the meeting. The owner drafts first; the group reviews. A blank page in a room of ten is the slowest way to write anything.
Next steps
- Paste the template into your incident tool and hardcode the five SRE triggers into your 'does this need a postmortem?' check.
- Add the SLAs to your runbook: SEV-1 review in 3 calendar days, SEV-2 in 5 business days, owner named on the incident call.
- Capture replay + console + network in production so the timeline reconstructs from evidence, not memory — then let an agent pre-draft via MCP. Read the upstream 15-minute triage playbook for the steps before the write-up.
BugMojo captures rrweb session replay, console, and network for every error, and its MCP server lets Claude Code or Cursor reconstruct the timeline and draft contributing factors — so your owner reviews a draft, not a blank page.
Install the extensionFrequently asked questions
Frequently asked questions
Sources
- Google SRE Book — Postmortem Culture: Learning from Failure (Ch. 15) — Google (accessed 2026-06)
- Google SRE Book — Example Postmortem (the Shakespeare outage) — Google (accessed 2026-06)
- PagerDuty Incident Response — Postmortem Template — PagerDuty (accessed 2026-06)
- PagerDuty Incident Response — Postmortem Process (SEV-1/SEV-2 SLAs) — PagerDuty (accessed 2026-06)
- PagerDuty Postmortem Docs — The Blameless Postmortem — PagerDuty (accessed 2026-06)
- ITIC 2024 Hourly Cost of Downtime Report — Information Technology Intelligence Consulting (ITIC) (2024)
Get bug-tracking insights, weekly.
Engineering deep-dives, QA playbooks, and honest tool comparisons. No spam — unsubscribe in one click.

