AWS S3 Outage

Human error in maintenance command removed more capacity than intended, revealing initialization time dependencies.

Initiating Event

On February 28, 2017, an AWS engineer was troubleshooting an issue with the billing system in the US-EAST-1 region. The troubleshooting involved removing instances to isolate the problem.

The Maintenance Command

The engineer ran a command intended to remove a small subset of capacity for testing. However, due to a typo in the command parameters, it matched a much larger set of instances than intended.

Cascading Impact

The command removed significantly more capacity than expected. This triggered a restart sequence for the affected systems. Under normal conditions, this would be handled gracefully by failover systems.

The Hidden Dependency

The billing system had a specific initialization sequence that required reading state from the very systems that had just been removed. The initialization was not designed to handle their absence.

Restoration Delays

When the removed instances came back online, they needed to reinitialize and rejoin the cluster. This process had time dependencies that were longer than expected during recovery.

The Failure

S3 became unavailable for several hours. Not because the hardware failed, but because the initialization sequence could not complete fast enough to serve requests.

Why It Happened

The system was designed assuming single instances or small groups would be removed. No one had explicitly analyzed what would happen if a large fraction failed simultaneously.

Testing Gaps

Testing had never simulated large simultaneous failures. Load testing had never included the recovery initialization times. Failure modes were assumed but never verified.

Complexity Hidden

The system looked redundant on paper. Multiple data centers, multiple copies of data, automatic failover. But the initialization complexity was hidden and untested at scale.

Applying This

For your systems: Do you test recovery scenarios at realistic scale? Do you understand the time dependencies in your startup sequences? Have you simulated what happens when many components fail simultaneously?

How well do you understand what happens during recovery—not just when individual components fail, but when many do at once?

Events like this are rarely unique. Similar patterns appear across many industries and asset types.

See how this type of system thinking is applied in practice