Reflections on incidents, shame, organizational memory, and why resilience is a system property, not a report.
Most incident write-ups focus on what failed.
Nick Rockwell's interview focuses on something harder to talk about: why problems remain unresolved long after we understand them.
The gap between knowing and fixing is where resilience actually lives or dies.
After reading the transcript, what stood out was not a novel framework or a new reliability pattern. It was a repeated theme across deeply human and deeply technical stories:
This post is a reflection on that gap, and what it means for principal engineers who operate inside it every day.
Nick frames resilience as an operating system, not a document. Incidents are inputs. Learning is the output. The system in between determines whether the same failure returns six months later under a different name.
Many teams produce excellent postmortems and still repeat incidents. That is not a failure of analysis. It is a failure of follow-through, memory, and ownership over time.
This is what Nick repeatedly refers to as the long loop problem.
A short loop is straightforward:
A long loop spans weeks or months:
Most engineering organizations are optimized for short loops. Resilience lives in the long ones.
One of the most striking moments in the interview is Nick describing a six to eight month delay at Sonic.net where a known problem went unaddressed. Not because it was technically difficult, but because he felt emotionally unable to surface it.
This is not incompetence. This is shame.
For senior engineers, shame often emerges from responsibility rather than failure. You know the system. You feel accountable. The longer the issue sits unresolved, the harder it becomes to admit it exists.
This creates a vicious cycle:
From the outside, this looks like negligence. From the inside, it feels like paralysis.
This is why purely procedural fixes fail. You cannot checklist your way out of emotional blockage. Leaders who want resilience must actively design systems that make it safer to surface long-standing risks than to silently carry them.
That means:
Resilience begins where shame loses its power.
The interview goes deep on an organizational design question many companies get wrong by treating it as ideological.
Should you have a centralized reliability team, or distribute incident ownership across product teams?
Nick's answer is pragmatic and stage-dependent.
Early in a company's life, distributed ownership often works better. Context is local. Systems are smaller. The cost of coordination is higher than the cost of duplication.
As systems scale, the failure modes change:
At that point, some degree of centralization becomes necessary. Not to take ownership away, but to provide continuity, pattern recognition, and institutional memory.
The failure mode to avoid is absolutism. Purely centralized teams lose context. Fully distributed models lose coherence.
The right answer evolves with the company. Principal engineers should treat this as a design decision that needs revisiting, not a one-time org chart choice.
The podcast explicitly addresses the scenario every senior engineer recognizes: one person who knows the whole system.
Nick's insight is uncomfortable because it is obvious.
The solutions are well known:
What blocks them is scarcity.
When a system is under constant pressure, investing in redundancy feels like a luxury. The same person becomes the bottleneck because they are the fastest path through the crisis.
Over time, this creates a self-reinforcing trap:
Resilience requires treating knowledge concentration as technical debt. Not metaphorically, but operationally. If you do not allocate time to pay it down, it will compound until an incident forces the issue.
Nick describes a clear progression that most companies go through, whether intentionally or not.
Individuals save the day. Systems are opaque. Success is personal.
Patterns start to repeat. Coordination costs rise. Blame quietly reappears.
Incidents are inputs to a system. Learning is institutionalized. Memory outlives individuals.
Your advice as a principal engineer should change depending on where the organization sits.
A Series B company needs help moving beyond heroics without killing momentum. A public company needs help preventing process from becoming performative.
Resilience failures often happen when leaders apply advice from the wrong stage.
Nick references a global outage that caused real emotional pain across the company. What matters is not the specific trigger, but the aftermath.
The technical cause was understood quickly. The emotional residue lasted far longer.
This is where many postmortems quietly fail. They document the fix but ignore the organizational impact. Shame does not disappear when the service recovers. It lingers in future decision-making, risk tolerance, and silence.
If resilience is about learning, then emotional aftermath is part of the system. Ignoring it is equivalent to ignoring a degraded dependency.
Blamelessness does not mean avoiding uncomfortable facts. It means anchoring discussions in evidence rather than intent.
A sharply factual review reconstructs what happened using logs, traces, alerts, and timelines. It distinguishes between what the system did and what people believed the system would do. Motivation is discussed only in the context of decision-making constraints, not moral judgment.
The hard cases are when the facts implicate someone's judgment.
For example:
A truth-less culture pretends this was "just a system failure." A blame-full culture turns it into a character indictment.
A blameless but truthful review asks different questions:
The goal is not to absolve or condemn. It is to understand why a reasonable engineer, in that context, made a decision that later proved harmful. Often the answer implicates incentives, alert quality, rollout pressure, or missing safety rails rather than individual recklessness.
When teams get this right, accountability increases rather than decreases. Engineers become more willing to surface near-misses and judgment calls because they trust the system will examine causes, not assign shame.
A useful rule of thumb for principal engineers is this:
For example:
Fixing the proximate cause restores service. Fixing the root cause reduces recurrence. Addressing the structural cause requires changing ownership, incentives, or architecture review norms.
This is where many investigations quietly stop. Not because the answers are unknown, but because the next step is uncomfortable, slow, or politically expensive.
Psychological safety matters here because deep investigation often surfaces systemic issues that no single team "owns." Without safety, teams optimize for closure speed rather than learning depth, and the long loop remains open.
Nick describes a simple but powerful practice: a recurring review that surfaces every open thread, regardless of age.
Not a status meeting. A memory check.
What matters is not the cadence but the invariants:
This is how long loops close.
A real memory system needs more than action items.
At minimum, it must track:
Pattern detection emerges from metadata, not narrative prose. Principal engineers should think of this as an internal reliability database, not a wiki.
This gap is not accidental. Most engineering tooling optimizes for resolution speed, not memory. Issue trackers close tickets. Postmortems get filed. Wikis grow quietly stale. None of them are designed to keep long-lived risk threads visible, revisitable, and accountable across months or team boundaries.
This is the problem that led us to build COEhub. Not as another incident reporting tool, but as a system for tracking unresolved learning over time. A place where risk threads survive reorganizations, ownership changes, and attention shifts, and where "partially closed" is a first-class state rather than a polite fiction.
If the long loop is where resilience actually fails, then it deserves infrastructure of its own.
If there is one diagnostic question to take away, it is this:
When a risk survives multiple incidents, do you know why?
If the answer is unclear, the loop is still open.
Resilience is not whether the system failed.
It is whether the system remembers, acts, and changes before it fails again.
That is a design problem.
And it is one we can choose to solve.
If you recognize your organization in this post, you do not need a full transformation to begin closing long loops. Three concrete starting points are often enough.
Pick a cadence. Biweekly works well. Review every open risk thread regardless of age. Treat staleness as a signal, not a failure.
Stop treating "action item created" as equivalent to "risk addressed." Track what is mitigated, what is deferred, and why.
Explicitly reward engineers who bring forward long-standing issues, especially ones they feel personally responsible for. The goal is to break the shame spiral before it hardens.
Resilience does not come from reacting faster to the next incident. It comes from refusing to forget the last one.
That is a system design choice.