Data Centers
2017

AWS S3 Outage

Human error in maintenance command removed more capacity than intended, revealing initialization time dependencies.

Resources

Initiating Event

On February 28, 2017, an AWS engineer was troubleshooting an issue with the billing system in the US-EAST-1 region. The troubleshooting involved removing instances to isolate the problem.

The Maintenance Command

The engineer ran a command intended to remove a small subset of capacity for testing. However, due to a typo in the command parameters, it matched a much larger set of instances than intended.

Cascading Impact

The command removed significantly more capacity than expected. This triggered a restart sequence for the affected systems. Under normal conditions, this would be handled gracefully by failover systems.

The Hidden Dependency

The billing system had a specific initialization sequence that required reading state from the very systems that had just been removed. The initialization was not designed to handle their absence.

Restoration Delays

When the removed instances came back online, they needed to reinitialize and rejoin the cluster. This process had time dependencies that were longer than expected during recovery.

The Failure

S3 became unavailable for several hours. Not because the hardware failed, but because the initialization sequence could not complete fast enough to serve requests.

Why It Happened

The system was designed assuming single instances or small groups would be removed. No one had explicitly analyzed what would happen if a large fraction failed simultaneously.

Testing Gaps

Testing had never simulated large simultaneous failures. Load testing had never included the recovery initialization times. Failure modes were assumed but never verified.

Complexity Hidden

The system looked redundant on paper. Multiple data centers, multiple copies of data, automatic failover. But the initialization complexity was hidden and untested at scale.

Have you tested your system's initialization sequence under scenarios where large portions of your infrastructure restart simultaneously?

Events like this are rarely unique. Similar failure mode patterns appear across many industries and asset types — often invisible until the operating context changes.

Analysis by Reliability Management Ltd — specialist RCM trainers and facilitators with 30+ years of industrial reliability engineering experience across oil & gas, power, and process sectors.