Today I’ve been reading more detail into yet another recent data center power failure of the Amazon EC2 cloud computing service. There isn’t much they could do about having a car crash into a utility pole, but the rest of it is purely ridiculous.
1. The transfer switch that failed came with a different default configuration than their regular ones. What kind of a bonehead installs a critical piece of gear like a transfer switch (or any other piece of data center infrastructure equipment) without checking the configuration? Operate like this and you deserve what you get.
2. This is a perfect example why trying to save money with absolutely minimum run time power back up is a dumb idea. Sure, if everything goes perfectly, 15 seconds to get a generator started is fine. Gambling that you aren’t going to need enough run time to deal with a problem is a bad risk.
3. Excusing this by saying only some customers were affected does not inspire confidence. Just how many more of these misconfigured transfer switches are out there waiting to screw up?
This is the first time I’ve ever had anyone double up on data center bozo of the week. I’ll take suggestions from the floor on what would be an appropriate award for this dubious honor.
Email or call me or visit the SwiftWater Telecom web site for green and reliable data center DC power, bozo free!
Posted in data center disaster recovery, data center facilities, data center network
Tagged Amazon, bozo, catastrophe, cloud computing, data center, data center power, data center reliability, EC2, power failure
Today we got the story about Amazon’s outage of part of its EC2 cloud. In this post, I’ll examine what happened and what you can do to avoid the same fate.
Data center power distribution is even more critical today due to the spread of the data center cloud. Due to the condensing effect of the cloud, failure of a poorly designed power system can wipe out far more customer service than ever before.
The failure of the EC2 section was caused by the failure of one PDU, then the failue of the backup while the first one was being repaired. Two almost simultaneous failures of PDUs makes me wonder if Amazon is buying them from a shady guy in a back alley (hey mister, ya vant to buy a PDU?).
The problem here was that the reliability of the PDUs wasn’t properly characterized. In order to make the decision of n+x (n+1, n+2), you have to determine what the chances of multiple failures leaving you without enough capacity to operate are. You need enough redundancy to be sure you can repair a failure before having a second one. The sad part of this is that this is just another PR black eye for cloud reliabilty (especially when you add in 5 hrs to restore customer workload).
Amazon swings, and they whiff.
Vern, SwiftWater Telecom
I see that parts of Amazon’s EC2 data center cloud were down with power failures yesterday. I’ve written quite a bit on how not to allow power problems to wipe out your data center or cloud (apparently nobody is listening) but, 5 hours to restore customer virtuals? Yeek!
The word for the day is “greenwashing”. Does your green data center walk the walk?
Vern, SwiftWater Telecom
data center virtual cloud computing