Why Complex Systems Fail — Practical RCM & Reliability Engineering

Real-world failure modes and reliability analyses examined at system level — aligned with SAE JA1011/1012 RCM methodology and focused on function, operating context, and failure consequence.

Built on 30+ years of RCM workshop facilitation and reliability analysis practice by Reliability Management Ltd.

Based on publicly available investigation reports and system-level reliability analysis

Real Failure Cases

Power & Energy
2025
Iberian Peninsula Blackout
Cascading voltage instability when conventional power plants failed to maintain proper voltage control during high renewable penetration.
Power & Energy
2021
Texas Winter Storm Grid Collapse
Cascading failure when generation capacity assumptions did not account for simultaneous cold weather impact across multiple fuel types.
Power & Energy
2011
Fukushima Nuclear Disaster
Backup power system design did not account for flood levels that would occur given the initiating event requiring their use.
Process & Chemical
1988
Piper Alpha Platform Disaster
167 killed when a permit-to-work failure allowed a condensate pump to be restarted with its pressure safety valve removed.
Process & Chemical
2005
Buncefield Oil Storage Explosion
Multiple independent high-level protection failures occurred simultaneously during tank filling operations.
Data Centers
2017
AWS S3 Outage
Human error in maintenance command removed more capacity than intended, revealing initialization time dependencies.
Process & Chemical
2010
Tesoro Anacortes Refinery Disaster
Heat exchanger catastrophic rupture from High Temperature Hydrogen Attack—a silent degradation mechanism invisible to standard inspection.
Process & Chemical
2010
Deepwater Horizon Blowout
Test results indicating well control problems were interpreted as equipment anomalies rather than well integrity signals.
Data Centers
2017
British Airways IT Outage
Power supply restoration sequence caused damage to multiple systems, extending recovery well beyond the initial event.

Most organizations investigate failures after they occur. Fewer examine whether their systems are designed to prevent them in the first place.

Reliability-Centered Maintenance (RCM) is the structured method for answering that question — identifying failure modes, their consequences, and the tasks that address them. Practical RCM templates, facilitation guides, and training are available at Reliability Management Ltd.

Analysis by Reliability Management Ltd — specialist RCM trainers and facilitators with 30+ years of industrial reliability engineering experience across oil & gas, power, and process sectors.