Today I’ve been reading about Google’s GMail outage. This outage was triggered by a routine router maintenance that reduced capacity below demand and produced a cascading overload.
This type of self inflicted catastrophe seems to be becoming far more common. In this post, I’ll be examing some of the failures that lead to this kind of result.
The first failure is the failure to anticipate the cascade failure and create cutouts and other controls to contain it. There’s no shame in a single failure, the shame is allowing it to spread out of control. There’s no substitute for knowing your facility and having the proper tools to know exactly what’s happening.
The second failure is the failure to plan a proper maintenance. Of course it’s important to plan the maintenance so that things go right, but it’s even more important to plan for things going wrong. Every step should have a way for the maintainer to know if it had a negative effect and a way to immediately back the actions out, returning things to the previous state.
Even more important is a plan of last resort, an “all hands on deck” emergency response to a situation that can’t simply be reversed out.
The basic failures in the Google outage really leave me shaking my head. If you have n+1 (enough facility to support requirements plus 1 extra) redundancy, reducing that to n is a risky proposition. Reducing to n-1 or lower is a recipe for absolute disaster.
So what is the takeaway from these failures? Plan for the possibilty of failure and how to mitigate it BEFORE it impacts your customers.
There are plenty of competitors waiting to take your business down, don’t help them by shooting your own feet!
Vern, SwiftWater Telecom
Data Center, Telecom, Engineering Services