Data Centers & Critical Infrastructure
Critical infrastructure systems are designed with extensive redundancy to ensure continuous availability. Failures often reveal hidden dependencies, common-cause vulnerabilities, or recovery time assumptions that only become apparent when redundancy is actually tested.
Common Failure Themes
- •Single points of failure hidden within architectures assumed to be fully redundant
- •Common-cause vulnerabilities where supposedly independent systems share dependencies on infrastructure, configuration, or timing
- •Recovery time assumptions that do not account for initialization sequences or data consistency requirements
- •Maintenance or change procedures that temporarily remove redundancy without adequate risk assessment
Case Analyses
Data Centers & Critical Infrastructure
2017AWS S3 Outage
Human error in maintenance command removed more capacity than intended, revealing initialization time dependencies.
Data Centers & Critical Infrastructure
2017British Airways IT Outage
Power supply restoration sequence caused damage to multiple systems, extending recovery well beyond the initial event.
Data Centers & Critical Infrastructure
2012Azure Leap Year Outage
Date validation logic error in certificate handling caused simultaneous failures across distributed systems.
Has your redundant architecture been stress-tested under realistic failure scenarios, or is redundancy only verified at the component level?