Today we got the story about Amazon’s outage of part of its EC2 cloud. In this post, I’ll examine what happened and what you can do to avoid the same fate.
Data center power distribution is even more critical today due to the spread of the data center cloud. Due to the condensing effect of the cloud, failure of a poorly designed power system can wipe out far more customer service than ever before.
The failure of the EC2 section was caused by the failure of one PDU, then the failue of the backup while the first one was being repaired. Two almost simultaneous failures of PDUs makes me wonder if Amazon is buying them from a shady guy in a back alley (hey mister, ya vant to buy a PDU?).
The problem here was that the reliability of the PDUs wasn’t properly characterized. In order to make the decision of n+x (n+1, n+2), you have to determine what the chances of multiple failures leaving you without enough capacity to operate are. You need enough redundancy to be sure you can repair a failure before having a second one. The sad part of this is that this is just another PR black eye for cloud reliabilty (especially when you add in 5 hrs to restore customer workload).
Amazon swings, and they whiff.
Vern, SwiftWater Telecom