Today’s post comes from the post mortem of the Feb 24 data center power outage that took down Google’s AppEngine cloud computing service. The issue I see here is not the power outage itself or the fact that it wiped out such a large chunk of Google’s service (although that is disturbing) but the response to it.
I don’t know what the issue is that many of the large data center operators can’t seem to hold their power infrastructure together. With the condensing of services brought about by cloud computing, the same size power outages today have the ability to impact far more services and customers than they used to. The flip side to engineering the power system for maximum reliability and segmenting the power so that individual failures only impact small parts of the service is having an efficient and sensible response to having something unforeseen happen.
The first problem with the response was that Google didn’t have anyone present who understood the system enough to freestyle the recovery of it. When it comes right down to it, there’s no substitute for having someone who knows what’s really going on inside of things.
Having ordinary data center staff handle this kind of failure wouldn’t be a disaster presuming that there was correct procedural documentation and the failure actually fit the procedures that were written. The problem with this approach is that things may not obligingly fail in the way that you’re prepared for. Of course, throw in a bunch of official procedures that havn’t been properly updated and now you have a real recipe for chaos.
The final point to this is judgment. The staff has to recognize that, if they’re not qualified to freestyle a fix to the problem, they’re probably not qualified to debug a written procedure that doesn’t work. There shouldn’t be any debate or question about it. If the failure is impacting customers and the written procedure doesn’t apply or doesn’t work, fail over to backup, immediately.
One of the worst things that you can do is engage in panic mode engineering (especially on something you’re not qualified for) while customers are down and you have an operational backup facility available. The priority is not the debug the original problems, the priority is to get the customers back in service. When the customers are happy again, then you have breathing space to straighten out the original snafu.
Make everything as solid and robust as you can, prepare for the worst just in case, but don’t lose sight that the real priority is the customer.