I recently listened to the podcast Reflections on Incidents & Resilience with Nick Rockwell. Many ideas resonated, especially the need for long term memory to close the long loop on structural issues. Below are my observations, framed less as a recap and more as a point of view on what leaders can do next. I will reference the conversation, but the takeaways here are my own.
Early on, most companies rely on a heroic model. A few people carry the system in their heads, rush into every fire, and save the day. Nick acknowledges this is unavoidable at the start. My observation is that heroics create an emotional rhythm that favors the short loop. You fix today's outage, dopamine hits, everyone exhales, and then roadmap work swallows the calendar. The unfinished root causes fade out of view.
Leadership implication: stop celebrating only mitigation speed. Start measuring completion of structural work that prevents the same class of incident next quarter.
Nick pushes on a subtle point. Psychological safety is vital, but it cannot obscure the search for truth. I agree. If your post incident conversations protect feelings at the expense of accuracy, you will preserve the conditions for the next failure. The goal is not blame, the goal is facts that help you change system design, operational affordances, and team habits.
Leadership implication: set ground rules that make space for emotions, then hold a separate, sharply factual review where evidence, timelines, and design constraints are examined with rigor.
Nick's strongest idea is the distinction between the short loop and the long loop. The short loop restores service. The long loop changes architecture, interfaces, runbooks, and team topology. That work often spans quarters. My experience aligns with his worry. Long loop items die from neglect unless you install a memory system and a governance muscle that drags them back into the light week after week.
Leadership implication: maintain a single, durable ledger of open risk threads. Review it on a cadence that does not care about release pressure. Tie threads to evidence and decisions, not vague action items.
Nick describes a deliberate tilt toward more process so the crank keeps turning even when motivation dips. I agree with the intent and share his concern. Overfit bureaucracy can choke change. The trick is to create adversarial alignment. Make teams interlock through explicit contracts, SLOs, and change gates, then pair that with periodic refactoring windows and a safety valve for fast experimentation.
Leadership implication: codify change control and SLOs, but budget change velocity on purpose. Reliability that never evolves will fail a different way.
Two tracks are necessary:
Leadership implication: do not let the postmortem be the end. Make it the start of a traceable thread that survives reorgs and roadmap churn.
If the long loop is going to survive, your organization needs more than docs and good intentions. It needs a memory that is:
This is the reason we built COEhub. It acts as the memory system that Nick's model requires.
The outcome is simple. Your team stops re learning the same lessons. Your long loop stays alive until the real work is done.
If this resonates with your reality, COEhub was built to make the cadence stick. It keeps memory, highlights patterns, and nudges the long loop forward until the threads are, in Nick's words, not open anymore.
Your org gets smarter, and your systems do too.