This weekend’s adventure in the data center came courtesy of an expensive (and now defunct) APC rackmount power strip. What was interesting about this was not the failure of the unit (although I’ve never seen a catastrophic power strip failure in my 20+ years of this) but the cascading effects.
What are cascading effects? Cascading effects are widespread problems that begin from a small starting point and snowball into a much larger problem. In this case, the APC failure caused a fair amount of equipment other than what was directly connected to it to fail (leading to a near meltdown of the BlackBerry).
Cascades can make it very difficult to diagnose the fault. Large amounts of “collateral damage” can lead to information overload (yes, there IS such a thing as too much info!). This can easily lead to things taking far too long to diagnose or an inaccurate result (which can be awful if a tech drives 3 hours with the wrong part).
The first key to preventing the dreaded cascade is segmentation. This provides natural “breaks” to block the spread of a cascade, limiting the size and effect of the failure. Remember, the smaller the scope, the more likely a fast and accurate diagnosis is. It may seem easier and faster to just tie everything into a flat, unified architecture, but that’s just opening the door.
The second key is to program equipment intelligently, if not to stop the cascade, at least not to contribute to it. A great example of how to do this wrong was one of the recent Google outages. As an overloaded router responded slower and slower, it recognized that it wasn’t performing to par and simply dumped its load on the rest of the routers in the pool. This caused more routers to overload, dumping their loads to the ever shrinking pool, until, finally, there weren’t any routers left at all to serve any traffic. Now you have the interesting conundrum of how to restore service, since any router you managed to clear and restore to service would almost immediately overload again!
The final key is not to overload with information about the failure. Identify dependencies and set your monitoring system up accordingly (device A is dependent on device B to work so don’t alarm on A if B is down). Remember, concise and accurate information gets the diagnosis and the fix the fastest!
There is a brief rundown on keeping the evil cascades out of your data center. Now I’m going to take this APC unit out in the parking lot and run over it with a truck.
Vern, SwiftWater Telecom
data center, web hosting, Internet engineering