Tonight I’ve been reading about the FAA computer glitches that have plagued air travel. Reading this reveals that the FAA has raised the cascading failure to a new level.
Cascading failures have had a lot of play in data center and IT flops lately. The mighty powerhouse of Google had a major outage of its Gmail service because a maintenance gone wrong reduced router capacity below the amount needed for service, overloading other routers in a chain reaction. A number of major data center outages have been the result of cascading power failures.
In the FAA incident, a software misconfiguration during maintenance, leading to failure of a primary router and a backup router that failed to take the load. If you look at the previous outages, this is a common thread. Failure of both primary and backup, failure of a high reliability SONET ring, etc. As a former SONET engineer, I can say that nothing short of a major widespread disaster should disrupt a properly designed ring.
One of the quotes at the end of the article calls it out clearly. One part misbehaves and it disrupts everything else. It’s not the disasters like fire or earthquake destroying the facilities that should worry us the most, it’s the uncontrolled cascading failure.
So, how do we prevent the cascading failure? First, be prepared to recognize abberant results of maintenance procedures and have a way to back out of changes before they have major impact.
Second, avoid risky procedures of little benefit. This means no people with meters manually measuring voltage inside live power panels just to see if connections are good. Many of these failures have human error triggers.
Finally, make sure that all facilities have cutouts to prevent the spread of failures. You can’t always prevent mistakes but you can prevent a mistake from spreading into a catastrophe.
Make sure your data center or network isn’t next to be in the news for a preventable cascading failure.
Vern, SwiftWater Telecom