Tag Archives: failure

Friday data center tidbits: Virginia bozos the data center


First up today is the piece about the state of Virginia having a major data center outage. Here’s yet another catastrophe that’s the result of poorly planned and executed maintenance work (technicians were “checking for faulty storage equipment”? Do you not know it’s faulty or not BEFORE you start screwing around with it?). If there was even a chance that this could happen, it should have been done in a maintenance window well outside of business hours. On the other hand, it shouldn’t be surprising, considering this is the bunch that had rampant failures in their network because they didn’t bother to specify any redundancy in the contract.

The head of the Viriginia Information Technology Agency is quoted as saying sometimes failures happen no matter what you do. Not the kind of thing you want to hear from your IT people after doing something glaringly dumb.

The shipment of multicolored wigs and red squeaky noses is on its way to the Virginia Information Technology Agency, data center bozos of the week!

Email or call me or visit the SwiftWater Telecom web site for RELIABLE green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

The reliable data center, BlackBerry and the failure bandwagon.


Well, here we go again. Apparently, RIM has managed to barf the Blackberry Internet Service yet again (disclaimer, I’m writing this from my Curve). I’m fortunate enough to be using BOLT web browser and Twibble, so the only real things affecting me are email and Yahoo Messenger loss.

This most recent of high profile outages leaves me wondering about the state of data center operations today. In most businesses, it used to be that if you caused a major outage impacting customers, there would be severe consequences.

Data center operators seem to have acquired a much large tolerance for pain. While some data center outages have been the result of simple infrastructure failure, most of the ones this year have been the results of human error. Deployment of inadequately tested software, risky and unecessary manual testing, poorly planned maintenance operations with no way to detect or back out problems, operating storage systems without backups, just general ignoring of good system admin and engineering procedures. Couple this with lousy problem response (the BlackBerry problem has gone on for hours) and it’s a real toxic problem.

I don’t know why data center operators have abandoned good admin, operations, and engineering practices. I do know that whatever these companies think they’re gaining from this, it’s not enough for me to run my data center in such a mode, I’m not tolerant enough to pain.

I don’t think I’ll be giving up my BlackBerry anytime soon but I will be making sure I use 3rd party applications that don’t involve RIM. In the meantime, if there’s anyone out there that needs data center service that don’t require developing high pain thresholds, please let me know.

Vern, SwiftWater Telecom

The reliable data center, maintenance gone wild.


I’ve just been reading about how the recent Rackspace data center network failure had limited impact and more of the details on this. Just another poorly planned, poorly executed, and poorly responded to botched maintenance operation.

It seems like at least half of the major data center and cloud outages this year have been the result of error during maintenance procedures or unnecessarily risky maintenance operations. So, what went wrong and how do you avoid becoming the next hot Twitter topic (in a bad way)?

1. They loaded an obviously inadequately tested configuration on a critical router. Test, test, and test again! The more critical the system is, the more pain from failure, the more testing needs to be done!

2. They messed with a critical router during peak load hours rather than a normal late night maintenance window. It’s nice that their facility integration was going to take place during a late night maintenance window, but how do you consider making a functional change to a critical router “preparation” that’s suitable for a high load period? If the worst happens, do NOT aggravate your customers by blowing things out of the water during the time they need it the most.

3. The length of the outage would imply that they either had no way to know the change had caused a problem in the network, or they had no recovery plan, or they chose to try to reengineer on the fly, or all of the above. Test for proper operation after every step of a maintenance plan and have an escape route to back out of changes if something unforeseen happens. It’s a bad idea to be messing around trying to determine what went wrong in the process while you have customers out of service.If it affects customers more than a few minutes, that’s bad.

Plan as well as possible, recognize when things go wrong, back out promptly, and schedule to minimize potential customer impact. These are the not so secrets to a pain free maintenance.

2009 seems to be the year for bad examples.

Vern, SwiftWater Telecom

12 days of data center Christmas, data center IP network engineering specials

Troubleshoot your data center by a nose …


Tonight’s post comes from today’s episode in troubleshooting. Sometimes I wish I knew how to use the Force.

I’m a great believer that, even with all the fancy monitoring and diagnostic things available in the data center, the best troubleshooting tools are the tech’s own senses, sight, smell, and hearing. Sometimes even the fancy tools don’t help however.

Today’s episode occurred during server work in a “micro” data center consisting of 3 cabinets equipped with a small DC power plant, router, several Ethernet switches, a NAS, and a wild mixture of servers. I was working away at the server console when I heard the distinct snap sizzle and detected the acrid odor of catastrophically failed electronics. This set in motion and immediate search for the source of the failure.

The unusual part of the failure is that every piece of equipment in the installation remained operational, normally this kind of fault produces a total equipment failure. Add to that that the smoke produced disappeared quickly in the cooling air flow and we have a challenge.

So how do we track this down? Due to time limits, I couldn’t inspect everything. I removed and inspected each module of the DC power shelf (n+1!), inspected all the power gear, the networking gear, and the critical servers. Non-critical servers were shut down until they can be opened and inspected for damage in the morning.

It’s always nice when you can spot a problem before it effects customers, so get familiar with the sights, sounds, and smells of your data center and you’ll be surprised what you can spot before the NOC does!

Vern, SwiftWater Telecom
data center, web hosting, Internet engineering