The Myth of Root Cause

Why complex incidents do not fail the way our postmortems assume they do

The question arrives early in every postmortem.

It is usually asked with good intent. Closure. Accountability. A desire to move forward.

It is also the moment where learning most often stops.

Not because engineers lack rigor, or because teams are careless, but because the question itself assumes a model of failure that does not match how modern systems actually break. In complex software systems, incidents rarely have a single root cause. They emerge from the interaction of multiple technical, organizational, and human factors that only become visible in hindsight.

This is well established in safety science. And yet, in practice, organizations still ask for a single cause, still reward tidy explanations, and still structure postmortems to converge on one answer.

That tension is the real problem. And it is why the same classes of incidents keep returning even after the "root cause" was supposedly fixed.

Why root cause thinking persists

Before arguing against root causes, it is worth understanding why they are so durable inside otherwise sophisticated organizations.

Root cause analysis persists not because it works well, but because it satisfies powerful structural forces.

Cognitive closure

Humans are uncomfortable with open-ended explanations. A single cause provides psychological relief. It allows inquiry to stop. Multiple interacting factors feel unsatisfying, even when they are more accurate.

Organizational pressure to declare victory

Incidents are costly, politically and emotionally. Leadership wants to know when the problem is "done." A root cause creates a clear stopping point. A set of contributing factors implies ongoing work, tradeoffs, and uncertainty.

Legal and regulatory incentives

Many compliance and audit frameworks still expect a primary cause. Organizations internalize this expectation and shape their learning processes to produce one, even when reality resists simplification.

Cultural preference for tidy stories

Postmortems are not just learning tools. They are narratives that travel upward and outward. "A bad deploy caused the outage" fits in a slide. "Six latent conditions aligned over months" does not.

None of this means teams are naive. It means they are operating inside systems that reward certainty over accuracy.

How complex systems actually fail

In complex sociotechnical systems, failures are not caused. They are constructed.

By "constructed," we mean this in a very practical sense: normal decisions, reasonable local optimizations, and adaptive workarounds gradually shape a system into a failure-prone state long before any single triggering event occurs.

This framing comes directly from decades of work in resilience engineering and systems safety by researchers like Richard Cook, Sidney Dekker, and Nancy Leveson.

The uncomfortable implication is that nothing needs to go obviously wrong for a system to fail.

Instead, you see patterns like:

Safeguards slowly eroded to meet delivery pressure
Monitoring tuned to yesterday's risks, not today's architecture
Oncall engineers compensating for brittleness with heroics
Assumptions that held until scale, traffic shape, or dependency behavior changed

When an incident finally surfaces, the last visible action is treated as the cause. A deploy. A configuration change. A human decision.

That action matters. But treating it as the explanation mistakes the trigger for the terrain.

A concrete example: GitLab's 2017 database incident

GitLab's 2017 outage, involving the accidental deletion of production data, is a useful illustration.

A root cause framing says:

That statement is accurate. It is also nearly useless.

A contributing-factors analysis tells a different story:

Replication lag meant backups were not as current as believed
Monitoring did not clearly signal replication failure
Recovery runbooks assumed paths that had not been exercised
Access controls allowed destructive commands under time pressure
Organizational norms prioritized speed during incident response

None of these factors alone caused the outage. Removing any one of them might have prevented it. Together, they created a system primed for failure.

The learning value lies in the pattern, not the mistake.

Root cause framing erases that pattern.

When proximate causes still matter

Rejecting root cause thinking does not mean pretending proximate causes are irrelevant.

Sometimes you do need to know what broke.

If a binary crashes, a certificate expires, or a feature flag flips incorrectly, identifying and fixing that issue is operationally necessary. Teams must restore service, reduce blast radius, and stabilize the system.

The mistake is treating the proximate cause as the explanation.

Proximate causes are useful for recovery.
They are insufficient for learning.

When organizations stop at the fix, they stabilize the surface while leaving the deeper system unchanged. The incident does not repeat in the same form. It returns in a new one.

What a contributing-factors lens changes

Shifting from root causes to contributing factors changes both the questions teams ask and the answers they accept.

Instead of asking what caused the incident, teams ask what conditions made it possible.

They examine:

The assumptions the system relied on
The tradeoffs that made sense locally
The safeguards that failed quietly
The signals that were present but not actionable at the time

These questions are harder. They resist clean endings. They often implicate structure rather than individuals.

They also produce learning that accumulates across incidents instead of resetting every time.

Where manual analysis breaks down

In practice, serious incidents generate overwhelming amounts of data.

Chat transcripts. Alert storms. Metrics across dozens of services. Partial timelines reconstructed after the fact.

Surfacing five to ten meaningful contributing factors across that material is cognitively demanding. Doing it consistently across incidents is harder still.

This is not a lack-of-discipline problem. It is a scale problem.

Humans are good at reasoning. They are not good at exhaustively scanning history, correlating weak signals, or maintaining institutional memory over long time horizons.

When organizations rely entirely on manual postmortems, two things happen:

Analysis narrows to what is easiest to see
Recurring factors quietly repeat without being recognized as patterns

This is how teams end up with thoughtful postmortems and stagnant learning curves.

From analysis to institutional memory

If incidents are teachers, learning requires memory.

Not just documents stored somewhere, but knowledge that can be retrieved, compared, and reinforced over time.

A contributing-factors approach pays off only when teams can:

See which factors recur across incidents
Track whether mitigations actually change future outcomes
Revisit past learnings at moments when they matter operationally

This is where tooling can support, rather than replace, human judgment. Not by automating conclusions, but by lowering the cost of doing the right analysis repeatedly.

A concrete next step

If you want one thing to try in your next postmortem, start here:

Before agreeing on a "root cause," require the group to name at least five contributing factors, and ask which of them have appeared before.

If that feels uncomfortable, or slow, or messy, that is the signal you are finally learning from the incident rather than closing it.

Root cause thinking asks for an answer.

Contributing-factors thinking builds memory.

And resilient systems are built by organizations that remember.