Tag Archives: catastrophe

Failure to transfer: Bumbling data center power reliability, the iWeb outage.


Vern Burke, SwiftWater Telecom
Biddeford, ME

I’ve just been reading about the recent iWeb data center power failure. Lousy power design and botched operations strikes again.

Even though specifics of iWeb’s data center power configuration weren’t specifically revealed, we can tell a lot from what actually happened. Due to a nearby fire, the data center operators made the decision to shift the facility to emergency power (an entirely reasonable move). The transfer switch serving one of 3 generators failed to transfer, leaving one third of the data center dark when UPS batteries ran out. Where do I start on the boneheaded tricks on this one.

First, we know that the 3 generators were allocated 1 to each third of the facility. This means no generator redundancy. It sounds good to say “we have 3 generators!” until you find out that they’re not being operated in parallel with at least 1 spare (n+1). Right idea, a total swing and whiff on the execution.

Second, it’s apparent that there was no manual bypass for the failed transfer switch. Were they expecting to have to shut down the whole 1/3 of the facility if they ever needed to work on that transfer switch? Dealing with a failed transfer switch shouldn’t be any more difficult than sending someone down to the power room to transfer the power manually.

Third, if they actually did have a manual bypass, were the data center operators informed by the monitoring systems that that section of the data center was still running from UPS and there was enough run time from battery to get someone to the power room to pull the manual bypass? This is a the big problem I have with super short run time backup power such as flywheel UPS. If things don’t go absolutely perfectly in the 15 seconds of runtime you get, you don’t get a chance for a manual fix, you’re going down, period.Of course, splitting the generators into separate “zones” makes the short runtime problem far worse, since it’s much more likely that you’re going to have a total failure with a single generator.

It’s apparent from the article a number of large name providers are doing a similarly lousy job at their backup power redundancy, judging by four transfer switch failures this year with major loss of data center services each time. It’s really a rather pathetic performance.

So, what’s the takeaway from all of this?

1. If you’re going to run multiple generators, run them in parallel and at least n+1. I don’t care how many generators you have, if you’re allocating single generators to single zones, you’re vulnerable.

2. If you’re not going to run the generators in parallel, at least give enough run time from the batteries to deal with the problems you know are going to come up. I don’t care how often you test, if you’re running single generators, failure is going to happen (with this configuration, they could have easily have had this happen during a test!).

3. Make sure there’s a manual bypass for automatic transfer switches and that your operations people have the monitoring and the procedure to know when to pull it.

In a substantially sized data center, the consequences of failing to transfer are a lot worse than doing things right the first time.

iWeb, data center bozos of the week (squeaky red noses are on the way!).

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

swiftwater telecom rcs cloud computing logo

Thursday data center tidbits: Chaos at Facebook (again)


For more than 5 hours now, developers have been unable to make changes to the Facebook Integration settings for their apps. Attempts to change these settings returns:

Sorry, something went wrong.

We’re working on getting this fixed as soon as we can.

This failure doesn’t seem to be affecting apps that are currently running but it has dragged a fair amount of app development right down to a total stop.

This failure comes close behind the recent major Facebook outage caused by a runaway software integrity checker.

Vern

SwiftWater Telecom

Monday data center tidbits: Unhyping the cloud, Facebook INsanity checks


First up today is a piece about cutting out the cloud computing hype. The problem isn’t so much the hype over cloud computing as it is the rampant outright misuse of the term cloud, attaching it to things aren’t even remotely cloudy in an attempt to ride cloud computing’s coattails without actually making the effort to DO cloud computing.

Next up is the recent Facebook service fiasco. Getting your system to mark it’s own code as invalid and not hand the problem to a human to validate before taking radical action is especially brain dead. On top of that, now you have an error compounded on top of the original error and you have a cascading failure, all because of no sanity checking and no break for human input. Facebook gets our data center bozos of the day award for trusting too much in automation and then blowing it.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Friday data center tidbits: #datacenter failure excuses


Ok, so this one’s not so much an excuse as “bleep happens”. Northrop Grumman has apologized for the last major extended Virginia data center failure, sort of. Published accounts attribute the chaos to multiple failures of primary and secondary storage systems. From the article we get:

“In its apology, Northrop Grumman … went on to say that problems of this sort are not unusual with large technology transformation programs. “

Um, yes, yes they are, when the program is being run competently. I’ve seen this excuse used a lot recently and it’s the one thing I could think of that will NOT inspire customer confidence in you. Northrop Grumman, this week’s winner of the special “bleep happens” bozo award.

In another example of how NOT to run things, Microsoft blew up Hotmail for at least 16 hours on Thursday. The automatic response “well it only affected a ‘small number’ of users” is nearly as bad as the Northrup Grumman “bleep happens” defense. Call this one “well at least it was only small scale bleep that happened”. Of course, how “small” can a 16 hour service outage really be?

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Wednesday data center tidbits: VMware tap dances, RIPE and Duke make excuses


The funniest thing I’ve read all week is the piece about VMware dissing bare metal desktop hypervisors. The sequence sort of goes like this:

1. We promised a bare metal desktop hypervisor.

2. Wow, this is harder than it looks.

3. Well, the hardware on the average PC wouldn’t be compatible anyways so we don’t REALLY need to do this.

4. If we WERE going to do it, we’d do it better than those pansies over at Citrix anyways because we’re the experts.

When you consider I can load a bare metal server class hypervisor on regular PC hardware (Xen) without a problem and Citrix already has their bare metal client hypervisor, VMware comes off sounding like a petulant child.

Next up is a follow on to the RIPE/Duke BGP routing fiasco. It’s nice to know that it was a Cisco router bug that caused this “experiment” to go out of control but it doesn’t change the fact that NOBODY should have been feeding “experimental” BGP announcements out to the live Internet, period. Gives new meaning to the term irresponsible.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Friday data center tidbits: Virginia bozos the data center


First up today is the piece about the state of Virginia having a major data center outage. Here’s yet another catastrophe that’s the result of poorly planned and executed maintenance work (technicians were “checking for faulty storage equipment”? Do you not know it’s faulty or not BEFORE you start screwing around with it?). If there was even a chance that this could happen, it should have been done in a maintenance window well outside of business hours. On the other hand, it shouldn’t be surprising, considering this is the bunch that had rampant failures in their network because they didn’t bother to specify any redundancy in the contract.

The head of the Viriginia Information Technology Agency is quoted as saying sometimes failures happen no matter what you do. Not the kind of thing you want to hear from your IT people after doing something glaringly dumb.

The shipment of multicolored wigs and red squeaky noses is on its way to the Virginia Information Technology Agency, data center bozos of the week!

Email or call me or visit the SwiftWater Telecom web site for RELIABLE green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Thursday data center tidbits: “ventilate” the server, Microsoft bumbles the cloud


First up today is the piece about an employee getting drunk and shooting up his company’s server with a .45 .I confess to having had the urge to kick a server from time to time but never to shoot one. If this isn’t a call for cloud computing, I don’t know what it.

Next up is Microsoft botching their cloud computing service for two hours. I don’t know what to add to this except, thanks for doing your part to add to the perception that cloud computing is unreliable, twits.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

The top data center operations mistake.


I was just reading about the top 10 data center operations mistakes and I noticed they completely missed by far the most important one of all.

It’s certainly true that lack of proper procedures, training to properly follow the procedures, and training to make sound decisions when the procedure isn’t up to the situation are all important to smooth data center operations. The military wisdom that no plan of battle ever survives contact with the enemy frequently comes into play. The most beautifully crafted and approved procedures don’t mean a thing when it comes to an unanticipated situation and nobody involved can make a smart decision on their own.

The biggest mistake by far, in my opinion, and the one we’ve seen the most examples of in the last several years of data center failures, is failure to analyze risks sensibly. A large percentage of these outages have been the direct result of high risk, zero gain actions.

One good example of this is the irresistible urge to mess around inside of live electrical equipment. There is little to nothing you can do in a data center that is more dangerous and higher risk than working on exposed high voltage power, both to the health and safety of the person doing the working and the operation of the data center itself. The result of screwing this up can be spectacularly catastrophic (witness the transformer explosion at The Planet’s data center in 2009 due to a miswired added high voltage circuit).

Given the hazard, you would think that data centers would reserve that for only the most extreme need, however, many of those failures have been for the most trivial of purposes. One of the bad ones from this year involved a “maintenance” operation to manually check phase rotation in a panel (the correct A-B-C ordering of 3 phase power). Since this phase rotation is pertinent only to 3 phase electric motors, this is ZERO issue with data center IT equipment and far from justifiable, given the risk.

It comes down to a healthy dose of common sense. If you’re going to take all of your data center generator capacity offline to do something trivial and you only have 15 second run time flywheel UPS systems, that’s probably a BAD choice. If you know you restore generator capacity in far less time than the run time of your UPS, that makes a lot more sense.

Don’t do risky things for no gain and you’ll avoid the nightmares of the most preventable data center operations mistakes. It’s just that easy.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Extreme weather and the data center.


I’ve been sitting here this evening operating the data center under extreme weather protocols due to wild electrical storms and tornado warnings. I thought I’d take a few minutes and discuss how to protect a data center during extreme weather events.

Whether you subscribe to the idea of global warming or not, it’s apparent that this has already been a bumper year for violent weather. High winds, lightning, heavy rain, none of it is very conducive to keeping the data center up and operating. Obviously, being able to shut down is the best protection (this is where cloud computing really shines, the capability of moving services out of harm’s way), but what do you do when you can’t just shut it all down?

Here’s my weather protocol for tonight:

1. Identify critical services and the capacity needed to minimally run them. In this case, I was able to substantially reduce data center power load by shutting down redundant services and shutting down cloud computing capacity that wasn’t required to keep the critical services operating. Remember, reduced power load means extended run time on the backup power.

2. Transfer workloads to an alternate data center.

3. Reduce cooling capacity to reflect the lower data center power load (less load, more run time!). Insure that there is no water or wind infiltration via cooling system intake vents. In my case, I change the free air cooling system to passive intake to avoid blowing in water.

4. Secure all windows and doors against high winds. If an area can’t be reasonably secured, such as an area with large, vulnerable, plate glass windows, secure inner doors to isolate the vulnerable area.

5. Reduce power equipment capacity equivalent to power load reduction. Open breakers or unjack equipment to isolate it from any damage from extreme power events, such as a close lightning hit on the AC commercial power.

6. Make sure that emergency supplies and emergency lighting are all up to par.

7. Know what to grab and take and how to secure the data center in case the situation is bad enough to require abandoning the data center.

My previous post on dealing with a data center flood also applies to this as well.

Follow these protocols or use them as a starting point for your own and you’ll find that your data center can make it through almost anything Mother Nature can throw at you intact.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Thursday data center tidbits: data center quality control?


From the phenomenally bad idea of the day file, we get this notion to put helmet cams on data center people and have marketing or finance audit them for “quality control”. You’re really going to give an organization as usually tech clueless as marketing or finance direct control over your data center people? Let me know how that works out for you.

In the wake of yesterday’s Intuit data center power failure, I saw this article come up via Twitter. What’s funny about this is the part about their new “state of the art” data center. If you have a new “state of the art” data center, how can you possibly excuse blowing your service up twice in 30 days with catastrophic power screwups? Maybe they should ask the guys from the first piece how to do data center quality control?

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern