The data center, the cloud, and power failure.

Today we got the story about Amazon’s outage of part of its EC2 cloud. In this post, I’ll examine what happened and what you can do to avoid the same fate.

Data center power distribution is even more critical today due to the spread of the data center cloud. Due to the condensing effect of the cloud, failure of a poorly designed power system can wipe out far more customer service than ever before.

The failure of the EC2 section was caused by the failure of one PDU, then the failue of the backup while the first one was being repaired. Two almost simultaneous failures of PDUs makes me wonder if Amazon is buying them from a shady guy in a back alley (hey mister, ya vant to buy a PDU?).

The problem here was that the reliability of the PDUs wasn’t properly characterized. In order to make the decision of n+x (n+1, n+2), you have to determine what the chances of multiple failures leaving you without enough capacity to operate are. You need enough redundancy to be sure you can repair a failure before having a second one. The sad part of this is that this is just another PR black eye for cloud reliabilty (especially when you add in 5 hrs to restore customer workload).

Amazon swings, and they whiff.

Vern, SwiftWater Telecom


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s