Some more details have just come out on the recent IBM data center failure that blew Air New Zealand’s systems out of the water. Believe me, it doesn’t make it look any better, but it is instructive.
Apparently IBM was running Air New Zealand’s mainframe from a single back up generator while they were performing maintenance on the UPS system. A malfunctioning oil pressure sensor leads to generator shutdown and out go the lights. It’s also worth noticing IBM choose to do this during a high air traffic holiday weekend.
The first thing to wonder is, what in the world are they running for a setup that they don’t have a maintenance bypass for the UPS? And what the heck are they using for transfer switches. Smart procedure in this kind of case would either to choose to risk a short run on unprotected main power via a maintenance bypass around the UPS or run from generator with things set up to transfer to unprotected main power in case of generator failure. In this case, the only bet was on a fallible piece of machinery and a $10 part made it a bad bet.
The second thing was that they obviously didn’t clear the work (and the risk) with Air New Zealand. It was completely boneheaded scheduling for that sort of risk and all parties should have been aware of the potential for service effecting problems so as to be prepared to deal with them if they cropped up.
So what do we have here? Failure to recognize the critical nature of the system at hand, failure to take all available precautions to protect the system, risking everything on a single point of failure that can be expected to produce these kinds of events, failure to notify the customer of the maintenance, failure to get customer buy in to the schedule and procedure, failure to restore the system to operation promptly, I can’t think of anything IBM did right in the whole process.
Handle your critical data center power this way and you too can look like a bozo.
Vern, SwiftWater Telecom