First, let’s look at the cause of the failure. This occurred during a test of phase rotation. What is phase rotation? Simply, it’s the sequence of the phases in 3 phase AC power, A->B->C. Phase rotation is only important in two instances. The first is operating 3 phase electric motors (reverse two phases and the motor runs backwards). Second is if you’re going to sync a generator to the commercial AC power grid. Phase rotation has no impact at all on the operation of distribution transformers or IT equipment in the data center.
So first of all, they were testing something that would not have had an impact on the equipment it was operating. Second, they apparently were manually testing in live electrical enclosures. If you’re really concerned about phase rotation, invest in an inexpensive phase monitor and arrange it to alarm automatically (phase monitors are commonly used on things like elevators to prevent sudden motor reversals if the power company messes up the phase order). Manual testing on live electrical, especially live electrical without redundancy, is begging for disaster.
So, now that we know it was a disaster that didn’t have to be (as are most of the data center power outages I’ve seen recently), how can we expect this to get worse? The increasing use of containerized servers for cloud services concentrates the choke points for infrastructure. Where a goof in a traditional type data center may take down anything from a couple of servers to a couple of racks, you can now blow 2000+ servers out of the water at once. This is ok for someone like Google that plans to have thousands of servers down at any one time without impact. Anyone else who isn’t massively overbuilt is going to have a serious problem.
So what are the takeaways from this? Don’t muck around with live power unless absolutely necessary, permanent sensors are cheaper than catastrophes, act to prevent cascading failures before they happen, and don’t sneeze on containers.
Vern, SwiftWater Telecom