How to Make Postmortems People Actually Read

And turn them into a system that improves reliability instead of an archive that grows quietly

Most postmortems are written to be filed, not read.

They satisfy a process requirement, get linked in Slack once, and then join a long list of zombie documents in Confluence. Everyone involved knows this. Senior engineers have lived through it enough times to be numb.

The failure is not effort. Teams spend hours reconstructing timelines, debating phrasing, and negotiating tone. The failure is that the document is treated as the learning system, when in reality it is only an artifact of one.

If you want people to read postmortems, and more importantly to act on them, you need to change what you optimize for.

This post is about doing that.

The Core Problem: Postmortems Optimized for Completeness, Not Usefulness

Most postmortem templates are designed to answer the question: "Did we fill in all the sections?"

Very few are designed to answer: "What will someone learn from this three months from now?"

As a result, they become long, defensive, and oddly sterile. They capture everything, but highlight nothing. They explain the incident, but do not make the next one less likely.

People do not skip postmortems because they are lazy. They skip them because the signal-to-noise ratio is low and the cost of reading is high.

If you want postmortems to be read, every section needs to earn its place.

Decide Who This Is For Before You Write a Single Word

Before you start writing, answer two questions. You only need two.

Who is this postmortem for, specifically?
What decision or action should change because this exists?

That is it.

Many templates add a third question like "What would I want to see if I were reading this later?" which sounds helpful but usually overlaps with the first two. If you know the audience and the decision, the rest follows.

A postmortem written for oncall engineers next quarter looks very different from one written for leadership assessing systemic risk. Trying to satisfy everyone is how you end up satisfying no one.

If you cannot name the reader and the outcome, do not write the document yet.

Start With a TLDR That Carries the Load

If someone reads only one thing, it should be the TLDR.

Not a teaser. Not a vague summary. A real compression of the incident and the learning.

Annotated TLDR Template

Use this structure. Keep it short.

What happened
One or two sentences describing the failure mode and scope.

Impact
Concrete user impact, duration, and severity. Numbers if you have them.

Key contributing factors
Three to five bullets. These should already hint at patterns, not just incident-specific details.

What changes
The specific actions or decisions that will be different because of this incident.

If you cannot write a TLDR like this, the rest of the postmortem will not save you.

Prioritize Forensic Clarity Over Narrative Flow

Narrative is not the enemy. Humans make sense of systems through stories.

But the primary job of a postmortem is not storytelling. It is forensic clarity.

That means precision before prose.

Timeline Format That Actually Helps

A good timeline answers three questions quickly: what happened, when, and what we knew at the time.

Use a table format with four columns:

Time (UTC): When the signal occurred
Signal observed: What was seen or triggered
Interpretation at the time: How it was understood in the moment
Action taken: What response followed

This last column matters more than most teams realize. Decisions under uncertainty are where learning lives. Not in hindsight explanations.

If you want narrative, add a short synthesis section after the timeline that explains how these signals interacted. Do not bury the facts inside the story.

Stop Chasing "The Real Cause" and Document Contributing Factors

Language matters.

Talking about "the real cause" subtly reinforces the idea that incidents have a single underlying failure waiting to be uncovered. In complex systems, that mental model breaks down quickly.

What you want instead is a map of contributing factors.

Drop the RCA acronym entirely if you can. It carries too much historical baggage.

A Better Framing

Ask: "What conditions made this incident possible, harder to detect, or harder to recover from?"

Then group the answers:

Technical factors
Process factors
Organizational factors
Environmental or timing factors

This framing makes it easier to see patterns across incidents later. It also reduces the gravitational pull toward blaming individuals.

Use 5 Whys as a Tool, Not a Destination

5 Whys is not the problem. Stopping early is.

The moment you land on "an engineer made a mistake," you are not done. You have reached a symptom.

Concrete Example

Bad stopping point:
"The outage happened because an engineer misconfigured the cache."

Dig again:
Why was a risky change possible during peak traffic?
Why was there no guardrail or staging signal that would catch it?
Why did the system rely on tribal knowledge for a high-impact operation?

Now you are learning about system design, not individual performance.

Simple 5 Whys Matrix

Structure your 5 Whys with categories:

Why 1: Cache misconfigured (Technical)
Why 2: Manual change during peak (Process)
Why 3: No automated guardrails (Technical)
Why 4: Change review skipped under pressure (Organizational)
Why 5: Oncall incentives favored speed over safety (Organizational)

You do not need to go five levels every time. You do need to go far enough that the answer no longer names a person.

Make Action Items Scarce, Owned, and Verifiable

Action items are not proof of learning. Follow-through is.

Fewer action items is usually better. Each one should be:

Clearly owned by a team or role
Tied to a contributing factor, not a vague improvement
Verifiable. You should know when it is done without debate

If an action item does not change a future decision or constraint, question why it exists.

And if you are not going to track it, be honest.

Which leads to the hardest advice in this post.

If You Are Not Going to Follow Up, Do Not Bother Writing It

A postmortem with no follow-up loop is theater.

If action items are not reviewed later, if patterns are not revisited, if ownership quietly evaporates, the document becomes worse than useless. It creates the illusion of learning.

In that case, skip the postmortem and invest the time elsewhere.

The goal is not documentation. The goal is reduced recurrence.

What Humans Should Focus On, and What Systems Should Handle

Humans are good at judgment, synthesis, and sense-making. They are bad at exhaustive reconstruction and long-term memory.

Where possible, let systems handle:

Timeline reconstruction from logs, alerts, and chat
Linking incidents to similar past failures
Tracking action item follow-through across time

This is the gap most teams feel but cannot quite name. The learning dies not because people stop caring, but because memory decays and context disappears.

How COEhub Fits

COEhub treats postmortems as inputs to a learning system by automatically extracting timelines, clustering contributing factors across incidents, and tracking whether actions actually reduce recurrence, instead of letting each document die in isolation.

That is the difference.

The Checklist You Can Reuse

Before publishing a postmortem, check this list:

Can someone understand the incident from the TLDR alone?
Does the timeline separate signals from hindsight interpretation?
Are contributing factors named without blaming individuals?
Did the analysis go past the first human error?
Are action items few, owned, and checkable?
Is there a plan to revisit whether this actually helped?

If the answer to most of these is no, the document will not be read. And even if it is, it will not matter.

Closing Thought

The question is not whether your team writes postmortems.

It is whether your system remembers.