Most engineering organizations already "do RCA." They write postmortems. They assign action items. They run blameless reviews. They even invest heavily in observability and incident tooling.
And yet, the same organizations still experience recurring incidents, repeat classes of failure, and "surprise" outages that feel predictable in retrospect.
If you are a VP of Engineering, the hard question is not whether your teams know how to write postmortems.
It is this:
Why do organizations keep failing at incident learning despite decades of advice and a mountain of documents?
The answer is uncomfortable but straightforward:
We keep treating a single-incident document as the learning system. A document is evidence. It is not memory. It is not pattern recognition. It is not reinforcement.
In complex socio-technical systems, failure is rarely a clean chain. It is an accumulation of conditions, trade-offs, local optimizations, latent coupling, and "normal" adaptations that become unsafe only in combination. Richard Cook's classic framing makes this point sharply: complex systems operate in degraded modes and contain changing mixtures of latent failure, so "root cause" closure is often misleading.
This is not academic nitpicking. It explains why incident writeups are so often satisfying and so often ineffective.
A single incident can be "explained" in a way that is locally coherent, but the real learning lives in the intersections:
Jens Rasmussen's work on risk in dynamic systems describes how pressure, efficiency incentives, and local adaptation tend to push systems toward boundaries over time. Drift is normal. It is not a moral failure. It is a property of operating competitive systems.
Sidney Dekker's "drift into failure" line of thinking extends this: organizations do not suddenly become unsafe. They become unsafe gradually, through reasonable adaptations, until the environment exposes the coupling.
A single postmortem, written after the fact, is simply not shaped to detect drift.
Even when teams act in good faith, post-incident narratives are vulnerable to predictable bias.
This is why "write better postmortems" hits a ceiling. You are fighting human cognition and organizational incentives with a template.
Google's own SRE guidance calls out a failure mode many leaders recognize: without formal tracking, postmortem action items get forgotten, and outages recur.
But here is the deeper point:
Action items are not learning unless they accumulate into shared capability.
Many organizations can get better at tracking tasks. Far fewer can answer:
Those are memory questions. Not documentation questions.
The cost and frequency of modern outages make "RCA theater" unaffordable.
New Relic's 2025 Observability Forecast reports high-impact outages with a median cost of $2M per hour, about $33,333 per minute, and a median annual cost of $76M for surveyed businesses.
When downtime is that expensive, the goal of RCA cannot be "a completed document." The goal must be systematic reduction in recurrence and time-to-recovery via retained learning.
The future of RCA is not a more elaborate report. It is a change in what we think RCA is for.
The unit of learning is the pattern across incidents.
One incident might tell you "Redis timed out." Three incidents across six months might tell you:
A single postmortem will not reliably surface that. A learning system will.
To make incident learning real, you need a system that does four jobs consistently:
1. Capture (high-fidelity, low-friction)
Not just a summary. The actual signals and decisions: timelines, chat, calls, deploys, alerts, "what we believed at the time," not just what was true after the fact.
2. Selection (what is worth remembering)
Humans cannot reread an archive. A learning system must decide what becomes durable knowledge.
3. Synthesis (cross-incident pattern recognition)
Cluster related incidents. Surface recurring conditions. Track drift. Build "families" of failures.
4. Reinforcement (make learning reappear at the point of need)
If knowledge does not get retrieved during design, on-call, and change review, it decays. Reinforcement means the system pushes relevant prior learning into the workflows where decisions recur.
This is where most programs fail. They stop at capture.
"AI for RCA" is often marketed as automated root-cause guessing. That is the wrong target.
In complex systems, the value of AI is not replacing human judgment. It is improving the four jobs above:
Done well, AI reduces the cost of high-fidelity capture and increases the odds of reuse.
Done poorly, it produces confident stories that amplify hindsight bias.
A good system should treat AI outputs as prompts for investigation, retrieval aids, synthesis suggestions, not as "the answer."
If you want to evaluate whether your incident program is a learning system or just documentation, ask these questions:
(Useful, but common.)
Google warns that without formal tracking, action items are forgotten. (Better, still not learning.)
(This is where recurrence drops.)
(This is where resilience compounds.)
Most organizations stall at Level 2 because Levels 3 and 4 require infrastructure, not just discipline.
If you are buying or building for the future of RCA, the requirements are not "a nicer postmortem editor."
You want:
That is a learning system.
COEhub is valuable if, and only if, it operationalizes the structural shift above.
Here is the clean mapping:
If your organization is already strong at writing postmortems, COEhub is not competing with that. It is competing with the thing your postmortems cannot do on their own:
build organizational memory that compounds.
Downtime economics are rising, systems are becoming more coupled, and human attention is a hard constraint.
So the future of root cause analysis is not better writing.
It is a single strategic question for engineering leadership:
Do we have a system that turns incident experience into reusable memory, at scale, across time?
If the answer is no, recurring incidents are not a surprise. They are the expected outcome.