If I may borrow a line from Douglas Adams, don’t panic is perfect advice for when things go squirrely in the data center. In this article, I’m going to be talking about what happens when operations or maintenance procedures go wrong and how to handle them with minimum impact to your customers.
If one thing was made clear in 2009, no matter how big you are, a maintenance operation can blow up on you. Many of this years major catastrophes would have been totally avoidable with common sense engineering practices (for example the DNS debacle in Sweden resulting from running an untested script) but sometimes, no matter how much you test and plan and plan and test, things can go south anyways and from the thing you least expect it from. Unfortunately, most of the big news making failures have given us negative examples of how NOT to deal with it.
This week, I did a planned maintenance to replace the back end storage servers on our cloud service with upgraded equipment to allow for lots of future growth. Customers were notified of the time of work, nature of the work, and impact of the work. I started the maintenance by shutting down all the cloud virtuals, shutting down the cloud hosts, and then the storage servers. Changeout of the storage servers went almost perfectly smoothly, storage servers came up and were verified, cloud hosts were restarted and running, then the fun started. I couldn’t restart any virtuals because Xen Cloud Platform complained about the license being expired.
The first tip is, don’t panic. Having a serious data center service outage can be stressful, what with complaining customers and pressure from the boss. The more stress and pressure, the less effective you are at efficiently responding to the problem. Pressure does little except to distract the mind from the real problem at hand.
The second is to leave yourself enough room to deal with the problem. I anticipated this being only a two hour maintenance but I had four more hours available in the maintenance window before significantly impacting customers.
Third, assess the true scope of the issue so you don’t waste time on things that aren’t really the problem. A quick check confirmed no other problems in the data center, the new storage servers were working perfectly, the cloud hosts were booted without any problem, pretty much nails it down to a software fault. Also, since the cloud hosts were not touched except to shut them down and bring them up, I could eliminate a configuration mistake.
Fourth, verify the source of the problem. This was a bit tricky, since XCP isn’t intended to require a license, the commands relating to that aren’t addressed in the manual. A bit of Google search and poking around the system provided the proper command, revealing that XCP indeed thought that its license had expired.
Fifth,take action to verify that the problem can be worked around. I took one cloud host out of service and verified that a reinstall of XCP would reset the license for another 30 days (note that installing XCP is quite painless).
Sixth, restore at least a minimal amount of service. Slow operation, under capacity, it’s all better than having customers out of service entirely. I identified the minimum number of hosts to provide customer services, applied the work around, and put each host back to work as soon as they were ready. First customer services were coming back two hours later than the anticipated end of the maintenance but still well within the window.
Seventh, communicate with customers. I provided all affected customers with a notice that we’d had an unplanned software failure during the maintenance, that all services were up and running albeit at reduced performance, and that we were addressing the software issue with the appropriate people. If you let customers guess what’s wrong, they’ll always assume the worst and react that way.
End result of this was that all customer services were operational at the end of the maintenance window, I knew exactly what and where the problem was, and I knew who the right person was to talk care of it. After a brief “HELP!” message detailing the problem and a little bit of sleep, I had a corrected binary (many thanks to Dave Scott of Citrix for his speedy response!), fixed the worked around hosts, fixed the rest of the hosts, and little to no pain to the customer at all.
It’s bad enough to be sandbagged by an unexpected problem in the data center, but how you respond to it makes the difference of whether you come out the hero or the goat.
Vern, SwiftWater Telecom