Data Centers
2017

British Airways IT Outage

Power supply restoration sequence caused damage to multiple systems, extending recovery well beyond the initial event.

Initiating Event

British Airways' data center in London experienced a power failure. The primary power distribution system failed, causing loss of power to critical IT systems.

Initial Response

Backup systems worked as designed. Battery backup and diesel generators kicked in, maintaining system operations during the initial failure. For a short time, everything worked correctly.

The Restoration Sequence

Once the primary power was restored, the recovery sequence began. Power was gradually restored to systems in a specific order. However, the sequence was not coordinated with the recovery needs of all dependent systems.

Unplanned Cascade

When certain systems powered back on, they attempted to access systems that were still offline. This created cascading failures as systems tried to boot or recover in an undefined state.

Corruption and Locks

Some systems developed corrupted states during the restoration sequence. Others developed file locks or resource conflicts. These corruption states persisted even after full restoration.

Extended Recovery

What should have been a few minutes of recovery took many hours. The restoration sequence itself had created new failure modes that required manual intervention to resolve.

Why It Happened

The backup systems were tested and worked well. The restoration sequence was not tested as thoroughly under real failure conditions. The dependencies between systems during recovery were not fully mapped.

Assumption About Restoration

The assumption was that if power is restored and systems boot, they will return to normal. In reality, restoration is not the mirror image of failure. The sequence matters. The timing matters.

Testing Gaps

Disaster recovery testing had focused on failover. Testing had not focused on the restoration sequence itself. This sequence had never been run end-to-end under realistic conditions.

Applying This

Do you test recovery sequences as thoroughly as you test failure? Do you understand the timing and ordering dependencies when systems come back online? Have you tested whether your backup and restoration systems introduce new failure modes?

What happens during recovery is often more complex and fragile than what happens during failure—do you know what that means for your systems?

Events like this are rarely unique. Similar patterns appear across many industries and asset types.

See how this type of system thinking is applied in practice