IBM data center performs face plant in New Zealand


IBM has turned in another pathetic performance, blowing Air New Zealand’s computer systems right out of the water (and here). The inability of data center providers to keep the lights on is getting ridiculous.

According to the stories, the root cause of the outage was a generator failure during maintenance work. Now it’s one thing to have an unaticipated failure in operation but having a failure during maintenance that requires 6 hours to resolve is just poor contingency planning. Imagine how happy the customer will be when they know they were down 6 hours because of something you specifically did.

It seems like the power competency of data center operators is getting worse and worse, given the string of high profile catastrophic failures this year. Everything from human screwups (The Planet) to disintegrating infrastructure (Fisher Plaza).

The trend that concerns me the most is towards cheaper, “micro run time” power back up such as flywheel. Where traditional UPS systems typically run 30 minutes, flywheel systems run less than 1 minute. They certainly work fine if everything goes right, but if anything goes wrong, there’s zero time to deal with it, you’re going down uncontrolled.
Granted, the flywheel UPS may be more green, since it doesn’t require lead the way battery strings do, but the one shot or nothing operation of them is seriously risky.

We take a different approach on this. We combine the green benefits to data center DC power combined with battery strings big enough to provide hours of run time, not minutes or seconds. It simply isn’t worth taking the risk.

Sometimes things go so bad there isn’t anything you can do, but at least give yourself a chance. It will put you ahead of the big boys.

Vern, SwiftWater Telecom

data center, web hosting, Internet engineering

Advertisements

8 responses to “IBM data center performs face plant in New Zealand

  1. What you are saying about flywheels contradicts:

    15 Seconds Versus 15 Minutes: Designing for High Availability

    The paradigm of requiring 15 minutes or more of battery backup for mission critical UPS system reliability is an antiquated and flawed perception. When properly integrated and maintained, standby generators can and will reliably support the critical load in 10 seconds or less. This challenges the idea lead-acid batteries and extended backup time are necessary. The growing intolerance to a “graceful shutdown” also renders 15 minutes of backup moot. The UPS system can be designed with much higher reliability and predictability by using more reliable backup energy methods and applying proper design
    techniques. This paper discusses the issue and methods of implementing a short ridethrough system with higher reliability and predictability than with traditional methods.

    Some engineers and power system designers feel there is a significant difference between typical applications of 25 to 30 seconds of reserve time as with a flywheel UPS system, and 15 minutes as in a conventional static UPS and battery system. There is a
    reliability difference, but the actual advantage may surprise many.

    Source:
    http://www.equipmentprotectionmagazine.com/images/Sept09.pdf

    • It’s absolutely true that a perfectly operating generator will start and generate fast enough to deal with only 15 seconds of hold up time. The problem is, what happens when it doesn’t? I’ve seen stories from people who had generator EPO buttons left in after maintenance by human error. Couple that with the fact that, even with the best cared for generator, you’re still going to have things fail like that tiny oil pressure sensor on the IBM generator. For me, running minimum hold up time on the UPS is playing Russian roulette. 99 times out of 100 it will be fine, number 100 will be pure chaos.

      Vern

  2. Roberto Cazador

    Vern, thanks for your response. I ran into a report for a battery UPS system which supplied up to 13 minutes of back up power and yet the generator failed to start during this time.

    Tata Communications London outage
    http://www.datacenterknowledge.com/archives/2009/11/30/european-data-center-revenue-may-double/

    How do I know Tata uses battery:
    File Format: Shockwave Flash
    Target of 100% Network and Power Uptime Tata Communications IDC SLA Terms … End Testing Battery Health IDC Manager UPS Room Battery Bank Source of Power …
    http://www.tatacommunications.com/map-facilities/tata_facilities.swf

    Tata has operations worldwide,so the damage must had been skyhigh.

    Roberto

    • Sometimes, nothing you can do can head off the disaster, no matter how much time you have to jump on it ( most of my DC data center power plants are engineered for 6 to 8 hours run time). 13 minutes should be enough time to deal with the screwup category type problems. If someone can run down to the power room and pull an EPO button, or manually start a generator with failed automatic controls, and head off a disaster, I think that’s worth the expense of providing more than 15 seconds of carry over.

      Vern, SwiftWater Telecom

  3. Roberto Cazador

    Vern, I understand what you are saying, but if you look at a different source of the story, Tata was somehow looking for trouble. Poor planning and communications. Who was monitoring the data center? Heads will likely roll. Wow, 3 generators failed to start.

    Extract follows:

    >>Even more interesting is that TATA appears to have been completely unaware of the power outage until a customer contacted them at 4:45 PM to report that their monitors were showing the data center’s temperature rising. At 6:55 PM a power engineer arrived to find that the UPS batteries were depleted and the three generators failed to start.

    In other words, the data center was dead and stayed that way until systems started coming back online at 7:30 PM. TATA finally called its customers at 9:50 PM to let them know that the utility power was back but was at risk for another 8 hours until the UPS batteries were fully charged.<<
    http://seacliffpartners.com/wordpress/?p=421

    You got some good stories, so I'll keep an eye on your work.

    • That’s absolutely pathetic, not to know that you’ve got a total power outage for more than 2 hours. I don’t know what I can say about none of the 3 generators not starting, just shaking my head.

      Vern, SwiftWater Telecom

  4. Roberto Cazador

    I thought I pass this along:

    >>Runtime & Tech Alternatives

    One of the toughest decisions data centers face when designing a UPS solution is estimating needed runtime. The norm for enterprises averages between 10 and 20 minutes, but many experts agree that five minutes, and probably even two, is sufficient. The real role of a UPS system is to carry a data center through from power loss to generator startup, and if the generator isn’t up and running in a minute or so, then having additional minutes from a UPS isn’t going to magically fix a defective generator. Organizations can burn considerable cash by maintaining excessive and unnecessary UPS runtime.

    Given that shorter runtimes may be all a data center truly needs, flywheel-based backup products could make sense, especially in larger organizations. Because flywheel UPSes use rotational inertia (kinetic energy) rather than batteries, they’re far easier to maintain, don’t require special ventilation, and have no hazardous material disposal issues. However, flywheel technology has much higher initial costs. The key advantage with flywheel is that it remedies the many tiny power losses data centers sustain over the course of a year. A loss of more than two seconds will cause a battery-based system to cycle, which reduces the longevity of the battery. This is why UPS batteries typically have to be replaced after only 24 to 36 months. During micro-outages, though, a flywheel system hardly registers a blip and won’t fall prey to admins having accidentally overlooked a faulty battery.<<
    More from:
    How To
    General Information
    December 4, 2009 • Vol.31 Issue 29
    Page(s) 21 in print issue

    Pick The Best UPS Solution
    Evaluate Your SME’s Needs
    http://www.processor.com/editorial/article.asp?article=articles/P3129/15bp29/15bp29.asp&guid=

    Best regards.

    • That’s true, having extra minutes doesn’t “magically fix” the generator, it gives the human facilities operators time to analyze the failure and determine an appropriate response. I guess it all has to do with your tolerance for letting your facility go down uncontrolled. For me, I will NOT allow my data centers to go down in anything short of an extended total catastrophe. And yes, I’ve had backup power issues that were perfectly resolvable in the run time of the batteries.

      It’s just not worth it to me to have a mob of angry customers.

      Vern

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s