Reflections & Learnings from Recent Cloudflare Incidents

Introduction

In late 2025, a sequence of outages at Cloudflare disrupted large portions of the Internet. High visibility platforms such as LinkedIn, Canva, Shopify, Spotify, ChatGPT, and Zerodha were affected. Even Downdetector was impaired, which is a reminder of how tightly coupled modern infrastructure has become.

Cloudflare published detailed postmortems and followed them with a company-wide initiative called Code Orange: Fail Small. These documents are notable not because the failures were novel, but because they expose a class of problems that many large systems now share. They reveal how speed, configuration, control planes, and organizational memory interact in ways that are no longer safely manageable through human intuition alone.

This essay treats the September, November, and December incidents as a single narrative. The goal is not to critique Cloudflare, but to use their unusually transparent analysis to explore what reliable systems require at global scale, and why memory must now be treated as a foundational engineering construct rather than an afterthought.

Two Failure Classes That Deserve Separation

One of the most important clarifications when reading these incidents together is that they fall into two distinct but related categories.

The September outage was primarily a control plane availability failure. A bug in the dashboard React code generated excessive API calls that overwhelmed the Tenant Service. Customers could not authenticate or manage their configurations.

The November and December outages were data plane correctness failures. Configuration changes corrupted runtime behavior, causing traffic drops and application outages.

These two planes serve different purposes and deserve different safety properties. Control planes should favor strong isolation, rate limiting, and explicit privilege escalation paths, even at the cost of latency. Data planes should favor graceful degradation, safe defaults, and availability over correctness.

Cloudflare's incidents show what happens when these concerns blur. Control planes inherited fragility from shared services. Data planes inherited brittleness from configuration that was treated as inherently safe.

September 2025: When the Control Plane Becomes the Load

The September incident began in what many organizations would still classify as a low risk surface. A frontend change in the Cloudflare dashboard caused a surge of API requests. Those requests overwhelmed the Tenant Service, a core authorization component.

The deeper issue here is not React or API efficiency. It is that the dashboard was implicitly trusted as a well behaved client. Internal tools often bypass the defensive assumptions applied to external consumers.

At Cloudflare's scale, internal traffic is often more dangerous than external traffic. It is authenticated, privileged, and capable of saturating shared control plane components quickly.

The lesson is structural. Control plane services must be designed to withstand abuse from first party components. They must enforce budgets, isolation, and backpressure even when the caller is the organization itself.

November and December 2025: Configuration as Executable Behavior

The November 18 and December 5 incidents shared a common mechanism. A configuration change was propagated globally within seconds and triggered widespread failure.

In November, an automatic update to the Bot Management classifier produced an oversized configuration file. Rust-based core systems crashed while parsing it. In December, a security driven configuration update related to a React vulnerability caused a similar cascade, affecting Workers scripts and APIs.

What matters is not the specific bug. What matters is that configuration acted as executable behavior with no staging, no gating, and no isolation.

Historically, configuration systems were built for speed. They exist to enable rapid reaction to threats and operational changes. At Cloudflare's scale, that speed eliminated the opportunity for observation and containment.

Failures propagated faster than humans could detect them. At that point, incident response is no longer a control loop. It is an archaeological exercise.

Health Mediated Deployments as a Control System

Cloudflare's decision to extend Health Mediated Deployments to configuration is the most significant technical response in Code Orange, but it deserves unpacking.

Health Mediated Deployment is not simply a canary. It is a feedback driven control system layered on top of deployment.

At its core, HMD requires every service owner to define what health means for their system. This typically includes error rates, latency percentiles, saturation signals, and in some cases semantic correctness metrics derived from synthetic probes. For traffic handling systems, this might include request success ratios, tail latency, and fallback activation rates.

The decision function is usually threshold based, but in mature systems it often includes trend analysis and guard bands. A deployment is allowed to proceed only if health metrics remain within defined envelopes over a sustained period. This is critical for catching delayed failure modes that do not manifest immediately.

Delayed failures are the hardest class. Memory leaks, state corruption, cache poisoning, and slow feedback loops often appear minutes or hours after rollout. A health mediated system must explicitly budget time for observation before progressing.

By applying this framework to configuration changes, Cloudflare is effectively stating that configuration is subject to the same uncertainty and risk as code. That is a necessary correction.

What Fail Small Actually Means Architecturally

Fail small is a compelling phrase, but it only becomes meaningful when translated into concrete architectural patterns.

At a minimum, fail small implies cell based isolation. Whether Cloudflare already had a cellular architecture is less important than whether configuration changes respected those cell boundaries. In these incidents, configuration correlated failure across all cells simultaneously.

Shuffle sharding is one well known technique for preventing correlated failure across customers. By ensuring that different customers depend on different subsets of infrastructure, a single bad configuration cannot affect everyone at once. Applying this to configuration propagation requires deliberate design.

Geographic staging is another missing layer. A regional canary would not have eliminated these bugs, but it could have limited blast radius and provided earlier signals. Global instantaneous propagation should be an explicit override, not the default.

Fail small also requires service progression isolation. A failure in Bot Management should not propagate into unrelated products such as dashboards or Workers. This requires strict interface contracts and defensive defaults.

In the November incident, a corrupted configuration file should have resulted in degraded bot detection quality, not traffic loss. In the December incident, a failed security signature update should have allowed traffic through with reduced protection.

Fail small ultimately means designing defaults that preserve availability when correctness cannot be guaranteed.

The Testing Gap and Why It Persists

Cloudflare correctly notes that testing environments rarely reflect production complexity. This is true, but it is also incomplete.

The industry has developed techniques specifically designed to close this gap. They are underutilized, but they exist.

Configuration Fuzzing

Configuration fuzzing systematically generates malformed, edge-case, and adversarial configurations to find crash conditions before production does.

A configuration fuzzer for bot management rules might generate inputs like:

The November incident involved an oversized configuration file that crashed Rust parsers. A fuzzer generating configurations at the boundary of size limits would have found this. The question is whether such fuzzing was part of the release process for Bot Management updates.

Shadow Validation

Shadow validation applies new configurations against production traffic without enforcing behavior. The system computes what would happen, compares it to what did happen, and alerts on divergence.

Architecturally, this requires dual-path execution:

For the December incident, shadow validation would have revealed that the proposed security configuration was blocking traffic that the current configuration allowed. The divergence rate would have spiked before any customer was affected.

Schema Evolution Testing

When configuration schemas change, older consumers may still be running. Schema evolution testing verifies that new configurations degrade gracefully on systems that do not yet understand them.

This is particularly important for globally distributed systems where configuration propagates faster than code updates:

The rule is simple: a new configuration must never crash an old parser. It can be ignored, partially applied, or trigger a safe default. But it cannot cause a process to exit.

Defensive Parsing in Practice

The Rust crashes described in November strongly suggest insufficient defensive parsing. In a system where configuration is executable behavior, parsing failures should never crash core processes.

Here is what unsafe Rust parsing often looks like:

Every unwrap() is a potential crash. In a traffic-serving system, this code turns malformed configuration into an outage.

Defensive parsing never panics. It returns errors or falls back to known-good state:

The key principles: bound input size before parsing, return the current configuration on any error, skip malformed entries rather than failing entirely, and log everything for debugging.

Chaos Engineering for Configuration Systems

The techniques above test components in isolation. Chaos engineering tests how the system behaves when things go wrong at runtime. For configuration-driven systems, this means deliberately injecting the failure modes that cause outages and verifying the system degrades gracefully.

Each of Cloudflare's 2025 incidents maps to a chaos experiment that could have discovered the failure proactively.

Experiment 1: Configuration Size Explosion
The November incident occurred when a Bot Management configuration grew beyond expected bounds. A chaos experiment would deliberately inject oversized configurations and verify the system continues serving traffic:

If this experiment had run before November, the Rust parser crash would have been discovered. The abort condition would trigger on process_restarts > 0, and the team would know that oversized configurations cause crashes rather than graceful rejection.

Experiment 2: Dependency Unavailability
The November incident also revealed a circular dependency: Turnstile protected the dashboard, but Turnstile itself was affected by the outage. Customers could not log in to fix the problem. A chaos experiment would deliberately disable authentication dependencies:

This experiment directly tests the break-glass procedures. If operators cannot authenticate when Turnstile is down, the experiment fails and the circular dependency is exposed before an actual outage.

Experiment 3: Control Plane Saturation
The September incident showed that internal tools can overwhelm control plane services. A chaos experiment would simulate excessive internal load:

This experiment would reveal whether the Tenant Service has proper isolation between critical operations and bulk internal traffic. If the hypothesis fails, the team knows that rate limiting or priority queuing is insufficient.

Experiment 4: Configuration Propagation Race
Both November and December incidents involved configuration propagating globally before problems were detected. A chaos experiment would test whether rollback can outrace propagation:

If this experiment fails, it reveals that configuration propagates faster than it can be rolled back. This is precisely the condition that made the November and December incidents so severe. The fix might be staged rollouts, regional gating, or faster rollback mechanisms.

Incident Memory as Experiment Design

The most effective chaos experiments are informed by past failures. Every postmortem should generate at least one chaos experiment that would have detected the failure proactively.

This is where incident memory becomes operational. A system like COEhub can surface historical failures when engineers are designing experiments:

This closes the loop between incident response and proactive testing. Past failures become the specification for future resilience verification.

Testing cannot eliminate all risk, but these techniques systematically reduce the unknown unknowns that cause catastrophic failure. The gap between testing and production is real, but it is not insurmountable.

Control Planes Must Survive the Outage They Are Meant to Fix

One of the most instructive aspects of the November incident was the failure of Turnstile, which blocked access to the Cloudflare dashboard. Customers could not log in because the system protecting the login flow depended on the same failing infrastructure.

This is a textbook example of circular dependency. It is also extremely common.

Control planes must be more resilient than the systems they control. They must be reachable during failure. Security mechanisms must support explicit emergency bypass for authorized operators.

Break glass procedures are not operational hacks. They are core reliability features.

Organizational Memory as an Active System

The most interesting theme in these incidents is not technical. It is temporal.

November and December were not identical failures, but they were similar enough that the pattern should feel familiar. Configuration propagated globally. Safety mechanisms were bypassed. Blast radius was total.

This suggests that while individual incidents were understood, the pattern was not yet encoded into the system itself.

Traditional postmortems are static artifacts. They explain the past, but they do not automatically shape future behavior.

For memory to become a first class reliability primitive, it must be active.

This is where COEhub fits, not as documentation, but as infrastructure.

In a concrete sense, this means capturing incidents as structured data. Failure modes, triggers, blast radius, mitigations, and invariants become machine readable. That data feeds an MCP server that AI coding, testing, deployment, and debugging tools can query in real time.

A proposed configuration change can be compared against historical incident signatures. A pre-commit hook can surface related failures. A deployment pipeline can require explicit acknowledgment of known risks. Design review checklists can be generated dynamically from prior postmortems.

This is how organizational learning moves from narrative to enforcement.

Beyond Testing: Toward Stronger Guarantees

If configuration truly defines behavior, then testing alone is insufficient.

The core problem is that testing explores a sample of possible states. Configuration, especially at Cloudflare's scale, can produce combinatorial state spaces that no test suite will ever cover. When a single malformed configuration can crash a global fleet, we need guarantees that hold for all inputs, not just the ones we thought to test.

Typed Configuration Schemas

The most accessible starting point is schema validation with rich types. Rather than treating configuration as arbitrary JSON, define schemas that make invalid states unrepresentable.

Consider a simplified bot management configuration. A naive approach allows any structure:

This invites runtime failures. A typed schema in TypeScript with Zod enforces constraints at parse time:

The November incident involved an oversized configuration file. A schema like this would reject configurations exceeding size bounds before they ever reach the parser that crashed.

In Rust, the same discipline applies:

The key insight is that deny_unknown_fields and explicit size validation happen during deserialization. A malformed configuration never becomes a valid in-memory object.

Formal Specification of Configuration Invariants

Schemas catch structural errors. Invariants catch semantic errors. Some properties must always hold true regardless of what configuration is applied.

For a system like Cloudflare's, invariants might include: a configuration update must never reduce total serving capacity below a threshold, bot rules must not create contradictory actions for the same traffic pattern, and security configurations must not remove protections without explicit override flags.

These can be expressed in specification languages. TLA+ is one option for modeling configuration state machines:

For teams not ready for full formal methods, property-based testing provides a middle ground:

Property-based testing generates thousands of random configurations and verifies that invariants hold across all of them. It is not a proof, but it systematically explores edge cases that human-written tests miss.

Model Checking for Configuration State Machines

Configuration does not exist in isolation. It transitions through states: proposed, validated, staged, canary, rolled out, rolled back. Each transition has preconditions and postconditions.

A model checker can verify that no sequence of valid operations leads to a bad state:

This Alloy specification says: if we start with a safe configuration and apply any valid transition, the result must also be safe. The model checker exhaustively verifies this for all possible configurations up to the specified bound.

Why This Matters Now

These techniques are not theoretical luxuries. They are becoming practical necessities.

The November crash happened because a Rust parser encountered a configuration larger than expected. Typed schemas would have rejected it at the boundary. Formal invariants would have flagged it as violating size constraints. Model checking would have verified that no valid update path could produce such a file.

As AI systems increasingly propose configuration changes, generate deployment scripts, and suggest optimizations, the need for machine-checkable guarantees becomes acute. An AI assistant operating without formal constraints is optimizing in a space where silent failures are possible.

Formal methods and type systems do not eliminate the need for testing. They change what testing is for. Tests verify behavior. Types and specifications verify structure and invariants. Together, they reduce the space of possible failures to a manageable size.

The direction is clear. Configuration that can crash your fleet deserves the same rigor as the code that runs on it.

Conclusion

Cloudflare's late 2025 incidents are not a story of incompetence. They are a case study in what happens when infrastructure grows beyond the limits of human memory.

The postmortems are strong. Code Orange is the right response. Treating configuration like code, enforcing health mediated deployment, designing for failure between services, and hardening control plane access are all necessary steps.

The next step is to ensure these lessons persist.

As systems become more automated and AI driven, resilience will depend less on individual vigilance and more on whether the system itself remembers how it has failed before.

Long term operational memory, when made active, becomes a reliability primitive. Not a document. Not a process. A system property.

That is the deeper lesson embedded in Cloudflare's experience, and it is one the rest of the industry would be wise to internalize before learning it the same way.