I was just reading about the top 10 data center operations mistakes and I noticed they completely missed by far the most important one of all.
It’s certainly true that lack of proper procedures, training to properly follow the procedures, and training to make sound decisions when the procedure isn’t up to the situation are all important to smooth data center operations. The military wisdom that no plan of battle ever survives contact with the enemy frequently comes into play. The most beautifully crafted and approved procedures don’t mean a thing when it comes to an unanticipated situation and nobody involved can make a smart decision on their own.
The biggest mistake by far, in my opinion, and the one we’ve seen the most examples of in the last several years of data center failures, is failure to analyze risks sensibly. A large percentage of these outages have been the direct result of high risk, zero gain actions.
One good example of this is the irresistible urge to mess around inside of live electrical equipment. There is little to nothing you can do in a data center that is more dangerous and higher risk than working on exposed high voltage power, both to the health and safety of the person doing the working and the operation of the data center itself. The result of screwing this up can be spectacularly catastrophic (witness the transformer explosion at The Planet’s data center in 2009 due to a miswired added high voltage circuit).
Given the hazard, you would think that data centers would reserve that for only the most extreme need, however, many of those failures have been for the most trivial of purposes. One of the bad ones from this year involved a “maintenance” operation to manually check phase rotation in a panel (the correct A-B-C ordering of 3 phase power). Since this phase rotation is pertinent only to 3 phase electric motors, this is ZERO issue with data center IT equipment and far from justifiable, given the risk.
It comes down to a healthy dose of common sense. If you’re going to take all of your data center generator capacity offline to do something trivial and you only have 15 second run time flywheel UPS systems, that’s probably a BAD choice. If you know you restore generator capacity in far less time than the run time of your UPS, that makes a lot more sense.
Don’t do risky things for no gain and you’ll avoid the nightmares of the most preventable data center operations mistakes. It’s just that easy.
Email or call me or visit the SwiftWater Telecom web site for green data center services today.