Power stability, containers, and the data center cloud …


Today’s post comes from reading about the Rackspace cloud power outage with some more details here. Ready for this to get worse?

First, let’s look at the cause of the failure. This occurred during a test of phase rotation. What is phase rotation? Simply, it’s the sequence of the phases in 3 phase AC power, A->B->C. Phase rotation is only important in two instances. The first is operating 3 phase electric motors (reverse two phases and the motor runs backwards). Second is if you’re going to sync a generator to the commercial AC power grid. Phase rotation has no impact at all on the operation of distribution transformers or IT equipment in the data center.

So first of all, they were testing something that would not have had an impact on the equipment it was operating. Second, they apparently were manually testing in live electrical enclosures. If you’re really concerned about phase rotation, invest in an inexpensive phase monitor and arrange it to alarm automatically (phase monitors are commonly used on things like elevators to prevent sudden motor reversals if the power company messes up the phase order). Manual testing on live electrical, especially live electrical without redundancy, is begging for disaster.

So, now that we know it was a disaster that didn’t have to be (as are most of the data center power outages I’ve seen recently), how can we expect this to get worse? The increasing use of containerized servers for cloud services concentrates the choke points for infrastructure. Where a goof in a traditional type data center may take down anything from a couple of servers to a couple of racks, you can now blow 2000+ servers out of the water at once. This is ok for someone like Google that plans to have thousands of servers down at any one time without impact. Anyone else who isn’t massively overbuilt is going to have a serious problem.

So what are the takeaways from this? Don’t muck around with live power unless absolutely necessary, permanent sensors are cheaper than catastrophes, act to prevent cascading failures before they happen, and don’t sneeze on containers.

Vern, SwiftWater Telecom

data center facility engineering

Advertisements

2 responses to “Power stability, containers, and the data center cloud …

  1. You are absolutely correct “Don’t muck around with live power unless absolutely necessary, permanent sensors are cheaper than catastrophes, act to prevent cascading failures before they happen, and don’t sneeze on containers.”
    Testing a PDU with an Astromed, Hioki or any of the other types of diagnostic and capture meters on the market is much easier than explaining to your clients why the outage occurred. After witnessing an engineer testing electrical connections inside a live PDU with a torque wrench we had to reevaluate our procedures. Today our engineers go through vigorous training on the FlukeTi25 IR Cameras.
    We use them to locate potential hot spots on electrical equipment (PDU’s included) and then deenergize the equipment if necessary to effect repairs.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s