Closing the Long Loop: What Nick Rockwell Taught Me About Resilience

Reflections on incidents, shame, organizational memory, and why resilience is a system property, not a report.

Most incident write-ups focus on what failed.

Nick Rockwell's interview focuses on something harder to talk about: why problems remain unresolved long after we understand them.

The gap between knowing and fixing is where resilience actually lives or dies.

After reading the transcript, what stood out was not a novel framework or a new reliability pattern. It was a repeated theme across deeply human and deeply technical stories:

This post is a reflection on that gap, and what it means for principal engineers who operate inside it every day.

Resilience Is Not a Report. It Is a Feedback Loop.

Nick frames resilience as an operating system, not a document. Incidents are inputs. Learning is the output. The system in between determines whether the same failure returns six months later under a different name.

Many teams produce excellent postmortems and still repeat incidents. That is not a failure of analysis. It is a failure of follow-through, memory, and ownership over time.

This is what Nick repeatedly refers to as the long loop problem.

A short loop is straightforward:

Incident occurs
System is restored
Action items are created
Attention moves on

A long loop spans weeks or months:

Structural risks remain partially addressed
Ownership changes
Context decays
Emotional cost accumulates
The system quietly waits to fail again

Most engineering organizations are optimized for short loops. Resilience lives in the long ones.

The Shame Spiral and Why Senior Engineers Hide Problems

One of the most striking moments in the interview is Nick describing a six to eight month delay at Sonic.net where a known problem went unaddressed. Not because it was technically difficult, but because he felt emotionally unable to surface it.

This is not incompetence. This is shame.

For senior engineers, shame often emerges from responsibility rather than failure. You know the system. You feel accountable. The longer the issue sits unresolved, the harder it becomes to admit it exists.

This creates a vicious cycle:

Guilt increases cognitive load
Cognitive load degrades judgment
Degraded judgment delays action
Delay increases guilt

From the outside, this looks like negligence. From the inside, it feels like paralysis.

This is why purely procedural fixes fail. You cannot checklist your way out of emotional blockage. Leaders who want resilience must actively design systems that make it safer to surface long-standing risks than to silently carry them.

That means:

Normalizing the age of unresolved issues as a signal, not a failure
Reviewing old threads with curiosity rather than suspicion
Rewarding early surfacing, even when it is uncomfortable

Resilience begins where shame loses its power.

Dedicated vs Distributed Incident Response Teams

The interview goes deep on an organizational design question many companies get wrong by treating it as ideological.

Should you have a centralized reliability team, or distribute incident ownership across product teams?

Nick's answer is pragmatic and stage-dependent.

Early in a company's life, distributed ownership often works better. Context is local. Systems are smaller. The cost of coordination is higher than the cost of duplication.

As systems scale, the failure modes change:

Incidents cross team boundaries
Dependencies multiply
No single team sees the full blast radius

At that point, some degree of centralization becomes necessary. Not to take ownership away, but to provide continuity, pattern recognition, and institutional memory.

The failure mode to avoid is absolutism. Purely centralized teams lose context. Fully distributed models lose coherence.

The right answer evolves with the company. Principal engineers should treat this as a design decision that needs revisiting, not a one-time org chart choice.

The Bus Factor Is a Scarcity Problem, Not a Knowledge Problem

The podcast explicitly addresses the scenario every senior engineer recognizes: one person who knows the whole system.

Nick's insight is uncomfortable because it is obvious.

The solutions are well known:

Documentation
Redundancy
Pairing
Rotations

What blocks them is scarcity.

When a system is under constant pressure, investing in redundancy feels like a luxury. The same person becomes the bottleneck because they are the fastest path through the crisis.

Over time, this creates a self-reinforcing trap:

Knowledge concentrates
Load increases
Time to transfer knowledge decreases
Risk silently grows

Resilience requires treating knowledge concentration as technical debt. Not metaphorically, but operationally. If you do not allocate time to pay it down, it will compound until an incident forces the issue.

Incident Management Has a Lifecycle

Nick describes a clear progression that most companies go through, whether intentionally or not.

1. Heroic phase

Individuals save the day. Systems are opaque. Success is personal.

2. Transitional phase

Patterns start to repeat. Coordination costs rise. Blame quietly reappears.

3. Mechanistic phase

Incidents are inputs to a system. Learning is institutionalized. Memory outlives individuals.

Your advice as a principal engineer should change depending on where the organization sits.

A Series B company needs help moving beyond heroics without killing momentum. A public company needs help preventing process from becoming performative.

Resilience failures often happen when leaders apply advice from the wrong stage.

The Fastly Outage and Organizational Shame

Nick references a global outage that caused real emotional pain across the company. What matters is not the specific trigger, but the aftermath.

The technical cause was understood quickly. The emotional residue lasted far longer.

This is where many postmortems quietly fail. They document the fix but ignore the organizational impact. Shame does not disappear when the service recovers. It lingers in future decision-making, risk tolerance, and silence.

If resilience is about learning, then emotional aftermath is part of the system. Ignoring it is equivalent to ignoring a degraded dependency.

Blamelessness Is Not Truth-Less

Blamelessness does not mean avoiding uncomfortable facts. It means anchoring discussions in evidence rather than intent.

A sharply factual review reconstructs what happened using logs, traces, alerts, and timelines. It distinguishes between what the system did and what people believed the system would do. Motivation is discussed only in the context of decision-making constraints, not moral judgment.

The hard cases are when the facts implicate someone's judgment.

For example:

A senior engineer bypasses a rollout guard because they are under delivery pressure.
The change passes initial health checks.
A delayed failure mode triggers an outage hours later.

A truth-less culture pretends this was "just a system failure." A blame-full culture turns it into a character indictment.

A blameless but truthful review asks different questions:

What signals were visible at the time?
What tradeoffs were understood versus implicit?
Why did bypassing the guard feel like the least risky option in that moment?

The goal is not to absolve or condemn. It is to understand why a reasonable engineer, in that context, made a decision that later proved harmful. Often the answer implicates incentives, alert quality, rollout pressure, or missing safety rails rather than individual recklessness.

When teams get this right, accountability increases rather than decreases. Engineers become more willing to surface near-misses and judgment calls because they trust the system will examine causes, not assign shame.

How Deep Should Root Cause Analysis Go?

A useful rule of thumb for principal engineers is this:

For example:

Proximate cause: A cache invalidation bug caused elevated latency.
Root cause: The cache was shared across unrelated workloads with no isolation.
Structural cause: Teams are incentivized to reuse infrastructure without ownership boundaries, and no one owns cross-service risk.

Fixing the proximate cause restores service. Fixing the root cause reduces recurrence. Addressing the structural cause requires changing ownership, incentives, or architecture review norms.

This is where many investigations quietly stop. Not because the answers are unknown, but because the next step is uncomfortable, slow, or politically expensive.

Psychological safety matters here because deep investigation often surfaces systemic issues that no single team "owns." Without safety, teams optimize for closure speed rather than learning depth, and the long loop remains open.

The Biweekly Resilience Review as a Forcing Function

Nick describes a simple but powerful practice: a recurring review that surfaces every open thread, regardless of age.

Not a status meeting. A memory check.

What matters is not the cadence but the invariants:

Nothing disappears because it is old
Staleness is visible
Ownership is explicit
Partial closure is tracked, not assumed

This is how long loops close.

The Long Loop Tracker, Made Concrete

A real memory system needs more than action items.

At minimum, it must track:

Risk threads, not just incidents
Ownership transitions
Dependency relationships
Partial mitigation states
Time since last review

Pattern detection emerges from metadata, not narrative prose. Principal engineers should think of this as an internal reliability database, not a wiki.

This gap is not accidental. Most engineering tooling optimizes for resolution speed, not memory. Issue trackers close tickets. Postmortems get filed. Wikis grow quietly stale. None of them are designed to keep long-lived risk threads visible, revisitable, and accountable across months or team boundaries.

This is the problem that led us to build COEhub. Not as another incident reporting tool, but as a system for tracking unresolved learning over time. A place where risk threads survive reorganizations, ownership changes, and attention shifts, and where "partially closed" is a first-class state rather than a polite fiction.

If the long loop is where resilience actually fails, then it deserves infrastructure of its own.

Recognizing Your Resilience Maturity

If there is one diagnostic question to take away, it is this:

When a risk survives multiple incidents, do you know why?

If the answer is unclear, the loop is still open.

Resilience is not whether the system failed.

It is whether the system remembers, acts, and changes before it fails again.

That is a design problem.

And it is one we can choose to solve.

Where to Start If This Resonates

If you recognize your organization in this post, you do not need a full transformation to begin closing long loops. Three concrete starting points are often enough.

1. Create a recurring forum where nothing expires

Pick a cadence. Biweekly works well. Review every open risk thread regardless of age. Treat staleness as a signal, not a failure.

2. Make partial closure visible

Stop treating "action item created" as equivalent to "risk addressed." Track what is mitigated, what is deferred, and why.

3. Normalize surfacing old problems

Explicitly reward engineers who bring forward long-standing issues, especially ones they feel personally responsible for. The goal is to break the shame spiral before it hardens.

Resilience does not come from reacting faster to the next incident. It comes from refusing to forget the last one.

That is a system design choice.