Data Centers
2017

British Airways IT Outage

Power supply restoration sequence caused damage to multiple systems, extending recovery well beyond the initial event.

Resources

BA IT Outage: Systems View

Initiating Event

British Airways' data center in London experienced a power failure. The primary power distribution system failed, causing loss of power to critical IT systems.

Initial Response

Backup systems worked as designed. Battery backup and diesel generators kicked in, maintaining system operations during the initial failure. For a short time, everything worked correctly.

The Restoration Sequence

Once the primary power was restored, the recovery sequence began. Power was gradually restored to systems in a specific order. However, the sequence was not coordinated with the recovery needs of all dependent systems.

Unplanned Cascade

When certain systems powered back on, they attempted to access systems that were still offline. This created cascading failures as systems tried to boot or recover in an undefined state.

Corruption and Locks

Some systems developed corrupted states during the restoration sequence. Others developed file locks or resource conflicts. These corruption states persisted even after full restoration.

Extended Recovery

What should have been a few minutes of recovery took many hours. The restoration sequence itself had created new failure modes that required manual intervention to resolve.

Why It Happened

The backup systems were tested and worked well. The restoration sequence was not tested as thoroughly under real failure conditions. The dependencies between systems during recovery were not fully mapped.

Assumption About Restoration

The assumption was that if power is restored and systems boot, they will return to normal. In reality, restoration is not the mirror image of failure. The sequence matters. The timing matters.

Testing Gaps

Disaster recovery testing had focused on failover. Testing had not focused on the restoration sequence itself. This sequence had never been run end-to-end under realistic conditions.

Have you tested your disaster recovery restoration sequence as thoroughly as you've tested your failover mechanisms?

Events like this are rarely unique. Similar failure mode patterns appear across many industries and asset types — often invisible until the operating context changes.

Analysis by Reliability Management Ltd — specialist RCM trainers and facilitators with 30+ years of industrial reliability engineering experience across oil & gas, power, and process sectors.