Tag Archives: air new zealand

The data center in review, top 10 bozos of the year, 2009!

My coined term from my data center analysis and commentary this year: bozosity. Bozosity is a condition brought on by an imbalance of invisible particles known as bozons. This condition causes otherwise competent and sensible people to do incomprehensibly boneheaded things.

The Winners of the 2009 Data Center Bozo Awards are:

1. Microsoft and Danger for the T-Mobile Sidekick data loss debacle. MicroDanger did not win for operating critical storage systems without backups, but for the handling of the aftermath. MicroDanger immediately announced all data was lost, then, by the time they did recover most of the data, significant damage was done to T-Mobile and the sidekick, leaving everyone involved with a reputation for incompetence.

2. Fisher Plaza for knocking out major Internet services by blowing up an antquated, obsolete, and improperly maintained electrical system in their data center building. Aluminum bus bars are evil, k?

3. IBM for blowing Air New Zealand out of the water by performing power work during peak load period of one of Air New Zealand’s busiest travel weekends, unnecessarily running ANZ’s mainframe from a fallible generator alone, and taking an inordinate amount of time to restore service.

4. IBM for allowing a state of Texas elections commission storage system in their care to fail because it wasn’t in the contract to back it up.

5. Google for their brilliant example of cascading failure by sequentially overloading every router feeding their Gmail service.

6. Research in Motion for seeing how many BlackBerry back end failures they could squeeze in before the end of the year.

7. Amazon, Rackspace, Google, and a number of other providers who managed to blacken the term cloud computing by multiple reliability problems, most of which were self inflicted. Thanks a heap guys.

8. DreamHost for giving us a shining example of how NOT to do a major data center migration.

9. The people who operate Sweden’s top level DNS domain for turning lose an untested script and taking the entire thing down. Who knew a few missing dots could be so much trouble?

10.The Perth iX data center in western Australia for allowing a smoldering mulch bed outside the building to shut down the entire data center because they couldn’t locate a miniscule amount of smoke that was infiltrating the building and setting off an overly sensitive detection system.

Finally, I’d like to add a “dishonorable mention” award to FedEx for turning overnight delivery of a critical part into 3 days and nearly sticking me with working in the data center overnight Christmas Eve.

Looks like we survived the year but it sure wasn’t pretty.

Vern, SwiftWater Telecom

Oversimplifying the data center ….

Today I happened to wander across this article on simplifying for a happier data center. It’s a nice sentiment but I think it misses the correct definition of simplify.

Once you get past the new age happy babble, the gist of the article is to simplify by refusing “features”. While it is true sometimes that adding features badly leads to headaches from buggy results, rejecting new features that may make life easier in the name of simplicity is a non-starter.

So, what is the secret to simplifying the data center? Simple, identify which items are most likely to be a headache and eliminate them or use alternatives. A highly complex system that never ever gives a problem is not the problem. What do I classify under this type of heading? Any mechanical infrastructure systems (chillers, legacy HVAC, generators, etc), any electrical infrastructure that has a potential to catastrophically fail or where a catastrophic failure can turn a failure of a single redundant system into a single point of failure for the entire facility (consider the recent rash of transformer explosions and fires in data centers).

So, how do we simplify these things? Try augmenting legacy cooling with free air cooling (or moving to free air cooling completely). This will reduce the chances of a service impacting cooling failure dramatically, as well as reducing maintenance requirements to virtually nil. For generators, consider alternative green power sources or long run time static back up systems. My rule of thumb is, any time you introduce an internal combustion engine into the picture, reliability goes down and headaches go up (just ask IBM and Air New Zealand).

As I’ve written before in previous posts, keep oil filled transformers outside and away from the facility. If you are not operating your own local generation, it’s hard to avoid these, so know in advance they ARE going to be a headache sooner or later and act to minimize it.

Of course, from the internal data center standpoint, good network design that doesn’t add a lot of superfluous equipment and the most efficient use of servers possible never goes wrong either. Most of this is just good engineering for maximum reliability of the data center systems.

There you go, data center simplicity in a nutshell, and you didn’t even have to climb to a mountaintop guru to get it!

Vern, SwiftWater Telecom

data center facility engineering

IBM data center face plants for Air New Zealand, take 2

Some more details have just come out on the recent IBM data center failure that blew Air New Zealand’s systems out of the water. Believe me, it doesn’t make it look any better, but it is instructive.

Apparently IBM was running Air New Zealand’s mainframe from a single back up generator while they were performing maintenance on the UPS system. A malfunctioning oil pressure sensor leads to generator shutdown and out go the lights. It’s also worth noticing IBM choose to do this during a high air traffic holiday weekend.

The first thing to wonder is, what in the world are they running for a setup that they don’t have a maintenance bypass for the UPS? And what the heck are they using for transfer switches. Smart procedure in this kind of case would either to choose to risk a short run on unprotected main power via a maintenance bypass around the UPS or run from generator with things set up to transfer to unprotected main power in case of generator failure. In this case, the only bet was on a fallible piece of machinery and a $10 part made it a bad bet.

The second thing was that they obviously didn’t clear the work (and the risk) with Air New Zealand. It was completely boneheaded scheduling for that sort of risk and all parties should have been aware of the potential for service effecting problems so as to be prepared to deal with them if they cropped up.

So what do we have here? Failure to recognize the critical nature of the system at hand, failure to take all available precautions to protect the system, risking everything on a single point of failure that can be expected to produce these kinds of events, failure to notify the customer of the maintenance, failure to get customer buy in to the schedule and procedure, failure to restore the system to operation promptly, I can’t think of anything IBM did right in the whole process.

Handle your critical data center power this way and you too can look like a bozo.

Vern, SwiftWater Telecom

data center facility engineering

IBM data center performs face plant in New Zealand

IBM has turned in another pathetic performance, blowing Air New Zealand’s computer systems right out of the water (and here). The inability of data center providers to keep the lights on is getting ridiculous.

According to the stories, the root cause of the outage was a generator failure during maintenance work. Now it’s one thing to have an unaticipated failure in operation but having a failure during maintenance that requires 6 hours to resolve is just poor contingency planning. Imagine how happy the customer will be when they know they were down 6 hours because of something you specifically did.

It seems like the power competency of data center operators is getting worse and worse, given the string of high profile catastrophic failures this year. Everything from human screwups (The Planet) to disintegrating infrastructure (Fisher Plaza).

The trend that concerns me the most is towards cheaper, “micro run time” power back up such as flywheel. Where traditional UPS systems typically run 30 minutes, flywheel systems run less than 1 minute. They certainly work fine if everything goes right, but if anything goes wrong, there’s zero time to deal with it, you’re going down uncontrolled.
Granted, the flywheel UPS may be more green, since it doesn’t require lead the way battery strings do, but the one shot or nothing operation of them is seriously risky.

We take a different approach on this. We combine the green benefits to data center DC power combined with battery strings big enough to provide hours of run time, not minutes or seconds. It simply isn’t worth taking the risk.

Sometimes things go so bad there isn’t anything you can do, but at least give yourself a chance. It will put you ahead of the big boys.

Vern, SwiftWater Telecom

data center, web hosting, Internet engineering