I’ve just been reading about how the recent Rackspace data center network failure had limited impact and more of the details on this. Just another poorly planned, poorly executed, and poorly responded to botched maintenance operation.
It seems like at least half of the major data center and cloud outages this year have been the result of error during maintenance procedures or unnecessarily risky maintenance operations. So, what went wrong and how do you avoid becoming the next hot Twitter topic (in a bad way)?
1. They loaded an obviously inadequately tested configuration on a critical router. Test, test, and test again! The more critical the system is, the more pain from failure, the more testing needs to be done!
2. They messed with a critical router during peak load hours rather than a normal late night maintenance window. It’s nice that their facility integration was going to take place during a late night maintenance window, but how do you consider making a functional change to a critical router “preparation” that’s suitable for a high load period? If the worst happens, do NOT aggravate your customers by blowing things out of the water during the time they need it the most.
3. The length of the outage would imply that they either had no way to know the change had caused a problem in the network, or they had no recovery plan, or they chose to try to reengineer on the fly, or all of the above. Test for proper operation after every step of a maintenance plan and have an escape route to back out of changes if something unforeseen happens. It’s a bad idea to be messing around trying to determine what went wrong in the process while you have customers out of service.If it affects customers more than a few minutes, that’s bad.
Plan as well as possible, recognize when things go wrong, back out promptly, and schedule to minimize potential customer impact. These are the not so secrets to a pain free maintenance.
2009 seems to be the year for bad examples.
Vern, SwiftWater Telecom