Tonight I was reading about speeding up chiller restarts on power failure. Given the number of recent high profile data center failures, I thougt I’d talk some about disaster recovery from an uncontrolled shut down.
So, despite the best laid plans, it happened. No fans howling, deathly quiet, and you’re sitting there bathed in the glow of the emergency lights. Your facility is down flat and it wasn’t graceful. Now how do you go about recovering?
The power source for most data centers is a triad, commercial AC, UPS or DC power plant with battery string, and generator. As a rule, don’t even try to start restoration without at least two out of three. Flaky AC can cause serious damage to unprotected equipment, generators can dump at the drop of a hat (IBM and Air New Zealand), batteries have a finite run time.
The first order of business is to take as much load offline as possible. This will reduce the strain to the power system from trying to restart everything at once. First priority for restored AC or generator power should be recharge of the UPS or DC power plant battery string. The battery back up will absorb any transients from unstable main AC power.
Second priority would be a restart of the cooling systems. In this case, free air cooling has it all over chillers, since it takes virtually no time to return to operating temperature and the free air cooling will also put much less strain on the power system.
Third is to restore the operations center. This establishes full control over the power and cooling systems and preps for restart of the network and servers.
Next is to start restoration of the network, starting from the edge down. Restore edge routers and uplinks, then work backwards toward the servers. Restore critical servers such as DNS also.
I recommend restoring internal network and servers in blocks. This insures once again against restart overload and makes sure that any problems that happen on restart are small enough to deal with easily.
Follow these steps and you should have an operating data center again. The next challenge is to identify why the complete power failure happened. Did the generator run out of fuel? Did a transfer switch fail? Was the power down long enough to run out the batteries?
Hopefully you’ll never need this process for a graceful data center restart, but it could keep you from making a bad situation worse!
Vern, SwiftWater Telecom