Tag Archives: data center power

6 “real” ways to know it’s time to renovate your data center.


I was just reading this piece about 10 ways to tell that your data center is overdue for renovation. Great idea but, unfortunately, that piece was WAY off track, so I’m going to list my 6 ways here.

1. Cooling

You don’t need a fancy expensive air flow study to get an idea that your data center has cooling issues. A simple walk through will make significant hot or cold spots very obvious. Significant hot or cold spots means it’s time to rework things.

2. Space

If you wait until you can’t cram one more piece of gear in, as the article suggests, you’re going to be in a heap of trouble. Make sure all idle equipment is removed and set a reasonable action limit (such as 75% full) to address the space issue BEFORE you run up against the limit.

3. Power

Contrary to the article, reading equipment load information is NOT a sign that your data center needs to be renovated, it’s just good practice. Nuisance trips of breakers and the need to reroute power circuits from other areas of the data center are a dead giveaway that the original power plan for the data center needs a serious overhaul.

4. Strategy

You can’t create an IT strategy without considering technologies as the article would have you believe. First, inventory and evaluate the existing data center, identify where it’s falling short of meeting business requirements and goals, and then consider the technology to get it back on track. Every step in it’s proper order.

5. Performance

When it becomes apparent that the existing data center infrastructure is going to fail on any of the first four points with anticipated changes coming up, it’s time to retire it. Don’t let the problems occur and THEN fix them.

6. Organization and documentation

If touching any part of the data center is a major crisis because of over complication of the systems and/or inaccurate, incomplete, or just plain missing documentation, it’s a clear signal to get it revamped and under control before it causes a complete disaster.

Failure to transfer: Bumbling data center power reliability, the iWeb outage.


Vern Burke, SwiftWater Telecom
Biddeford, ME

I’ve just been reading about the recent iWeb data center power failure. Lousy power design and botched operations strikes again.

Even though specifics of iWeb’s data center power configuration weren’t specifically revealed, we can tell a lot from what actually happened. Due to a nearby fire, the data center operators made the decision to shift the facility to emergency power (an entirely reasonable move). The transfer switch serving one of 3 generators failed to transfer, leaving one third of the data center dark when UPS batteries ran out. Where do I start on the boneheaded tricks on this one.

First, we know that the 3 generators were allocated 1 to each third of the facility. This means no generator redundancy. It sounds good to say “we have 3 generators!” until you find out that they’re not being operated in parallel with at least 1 spare (n+1). Right idea, a total swing and whiff on the execution.

Second, it’s apparent that there was no manual bypass for the failed transfer switch. Were they expecting to have to shut down the whole 1/3 of the facility if they ever needed to work on that transfer switch? Dealing with a failed transfer switch shouldn’t be any more difficult than sending someone down to the power room to transfer the power manually.

Third, if they actually did have a manual bypass, were the data center operators informed by the monitoring systems that that section of the data center was still running from UPS and there was enough run time from battery to get someone to the power room to pull the manual bypass? This is a the big problem I have with super short run time backup power such as flywheel UPS. If things don’t go absolutely perfectly in the 15 seconds of runtime you get, you don’t get a chance for a manual fix, you’re going down, period.Of course, splitting the generators into separate “zones” makes the short runtime problem far worse, since it’s much more likely that you’re going to have a total failure with a single generator.

It’s apparent from the article a number of large name providers are doing a similarly lousy job at their backup power redundancy, judging by four transfer switch failures this year with major loss of data center services each time. It’s really a rather pathetic performance.

So, what’s the takeaway from all of this?

1. If you’re going to run multiple generators, run them in parallel and at least n+1. I don’t care how many generators you have, if you’re allocating single generators to single zones, you’re vulnerable.

2. If you’re not going to run the generators in parallel, at least give enough run time from the batteries to deal with the problems you know are going to come up. I don’t care how often you test, if you’re running single generators, failure is going to happen (with this configuration, they could have easily have had this happen during a test!).

3. Make sure there’s a manual bypass for automatic transfer switches and that your operations people have the monitoring and the procedure to know when to pull it.

In a substantially sized data center, the consequences of failing to transfer are a lot worse than doing things right the first time.

iWeb, data center bozos of the week (squeaky red noses are on the way!).

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

swiftwater telecom rcs cloud computing logo

Building out the data center the right way.


Tonight I’ve been reading an article about data center building trends. There’s some very good points to this and also some things that I think are very wrong. These also explain some things that have mystified me for some time.

Looking ahead 5 years for capacity planning isn’t a bad idea (except that the data center needs to be flexible enough to accommodate the changes that can happen in 5 years), but the whole decision on build out or not for data center infrastructure in advance hinges on the penalty for doing so. In short, there’s no penalty for building out passive infrastructure and a big penalty for building out active infrastructure.

I’ve been mystified by the idea that data center PUE (power usage effectiveness) only gets good when a data center is full. Now I understand, this is based on the idea of a data center building out (and operating) 100% of it’s cooling infrastructure in advance. If you’re only running 20% of your 5 year forecasted server capacity but you have to run 100% of your 5 year forecasted cooling capacity because it’s a monolithic system that’s either on or off, of course your efficiency is going to stink!

The PUE numbers for that kind of arrangement would be pathetic. Of course, as you add servers with the same amount of cooling power usage, the PUE would gradually get better, but who in the world would really WANT to run something that way? (Reference last year’s story about UPS cutting their data center power by switching off 30 out of 65 HVAC units!)

Leaving plenty of extra room for the transformer yard and the generator yard is a great idea (you can’t expand them once you get things built in around them). On the other hand, it would be silly to build out and power up a yard full of transformers that were sitting there doing nothing except chewing up power.

So, what sort of data center infrastructure things can safely be built out far ahead of the actual need? The physical building is a bit of a no-brainer, as long as the active sections can be isolated from the idle (you don’t need to be cooling an entire hall that’s only 10% full).

Passive electrical is another good one. This means entrances, disconnects, breaker panels, distribution cabling, transfer switches. No UPS, no DC power plants unless you’re going to be able to shut them off and leave them off until you really need them.

Passive cooling infrastructure such as ducts. Take a lesson from UPS, do NOT build out double the HVAC you need and run it all!

Finally, build out the support structures for the transformer and generator yards. Mounting pads, conduit, cabling, so the equipment is all set to drop, hook up, and go.

Don’t kill your data center’s efficiency in return for capacity 5 years from now.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Extreme weather and the data center.


I’ve been sitting here this evening operating the data center under extreme weather protocols due to wild electrical storms and tornado warnings. I thought I’d take a few minutes and discuss how to protect a data center during extreme weather events.

Whether you subscribe to the idea of global warming or not, it’s apparent that this has already been a bumper year for violent weather. High winds, lightning, heavy rain, none of it is very conducive to keeping the data center up and operating. Obviously, being able to shut down is the best protection (this is where cloud computing really shines, the capability of moving services out of harm’s way), but what do you do when you can’t just shut it all down?

Here’s my weather protocol for tonight:

1. Identify critical services and the capacity needed to minimally run them. In this case, I was able to substantially reduce data center power load by shutting down redundant services and shutting down cloud computing capacity that wasn’t required to keep the critical services operating. Remember, reduced power load means extended run time on the backup power.

2. Transfer workloads to an alternate data center.

3. Reduce cooling capacity to reflect the lower data center power load (less load, more run time!). Insure that there is no water or wind infiltration via cooling system intake vents. In my case, I change the free air cooling system to passive intake to avoid blowing in water.

4. Secure all windows and doors against high winds. If an area can’t be reasonably secured, such as an area with large, vulnerable, plate glass windows, secure inner doors to isolate the vulnerable area.

5. Reduce power equipment capacity equivalent to power load reduction. Open breakers or unjack equipment to isolate it from any damage from extreme power events, such as a close lightning hit on the AC commercial power.

6. Make sure that emergency supplies and emergency lighting are all up to par.

7. Know what to grab and take and how to secure the data center in case the situation is bad enough to require abandoning the data center.

My previous post on dealing with a data center flood also applies to this as well.

Follow these protocols or use them as a starting point for your own and you’ll find that your data center can make it through almost anything Mother Nature can throw at you intact.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Thursday data center tidbits: data center quality control?


From the phenomenally bad idea of the day file, we get this notion to put helmet cams on data center people and have marketing or finance audit them for “quality control”. You’re really going to give an organization as usually tech clueless as marketing or finance direct control over your data center people? Let me know how that works out for you.

In the wake of yesterday’s Intuit data center power failure, I saw this article come up via Twitter. What’s funny about this is the part about their new “state of the art” data center. If you have a new “state of the art” data center, how can you possibly excuse blowing your service up twice in 30 days with catastrophic power screwups? Maybe they should ask the guys from the first piece how to do data center quality control?

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

Wednesday data center tidbits: no power backup in the #datacenter?


First up today is about the idea of building a data center with no power backup at all. This is about as boneheaded a thing as I’ve ever seen. Does it REALLY pay you to not only duplicate but run extra infrastructure so you can save a bit in equipment costs by letting a data center fail? What about the cost of restoring all the downed equipment? Or the damage to equipment from wild power fluctuations that a battery backed system (such as our 48V DC power plant in our data center) would absorb? Squeaky red noses to Yahoo on this one.

Next up is a piece about improving data center airflow. What caught my eye was this, “…flowing air from the CRAC unit, through specialized floor tiles and into the server fans…”. Attempting to tout cooling efficiency with a hopelessly obsolete raised floor is an automatic FAIL.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

Extreme weather, #datacenter DC power, and #cloudcomputing.


Or, as the alternate title that comes to mind, “Mother Nature throws a hissy fit”. I’m going to talk in this post about how all of the above link together to affect data center and cloud computing reliability.

This year, it seem like the news is full of stories of violent weather around the country and it only seems to be getting worse. Even areas that traditionally have been fairly stable weather wise have been showing massive storms with damaging winds and flooding rains. For the first time in the 6 years I’ve been in this location (not to mention most of my life in this state), we’ve had 2 major spring/early summer storms with winds in excess of hurricane force.

So, how does this relate to data center and cloud computing reliability? The last storm materialized in only 5 minutes, produced torrential downpours and 100 mph winds, and caused large amounts of havoc with the commercial AC power supply to the data center. I’m a great proponent of green DC power in the data center so the power distribution is primarily DC with AC equipped with good quality traditional protection for the rest.

Unfortunately, the AC power excursion events from the severe weather were wild enough that the classic power protection turned out to be inadequate. The cloud computing servers themselves, powered by DC as they are, survived just fine. Both the primary and backup storage systems, powered from the AC, did not.

After several days of cleaning up the mess and getting the cloud restored and back on line, there are a number of takeaways from this.

1. It’s hard to go overboard engineering your data center for extreme weather, whether it’s likely to happen or not.

2. Data center DC power is a LOT more resilient than the best protected AC power. As a result of this, all required AC powered equipment is now on the DC power via inverters. This isn’t as green of course, but it isolates the equipment much better from radical power fluctuations in the data center AC supply.

3. In a cloud computing environment, make sure all the parts of the cloud have the same level of resiliency. There’s no point to keeping the front end alive when the back end goes down.

Finally, I’ve talked in a previous post about using DC power with a long run battery string to shift heat load in the data center. A DC power system with a long run time is also great protection against this type of event. No matter how fast or unexpected the severe weather is, simply shut down the rectifiers in minutes, run from the batteries, and you have the ultimate in isolation from AC power excursions.

Or, we could just write Mother Nature a prescription for Prozac.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

Intuit, #datacenter, and #cloudcomputing (the three horsemen of the apocalypse)


I knew it was going to happen just the moment I read about Intuit face planting their data center and web sites for 36 hours. The anti cloud computing crowd are out in force with their mantra that this “proves” that cloud computing is unreliable. What is does prove is that if people can’t come up with a good argument against something, a silly one will do in a pinch.

So, what exactly happened with Intuit? We do know that a power failure as the result of “routine maintenance” took down both primary and backup servers. I havn’t seen a detailed analysis of it yet, but a little deductive reasoning will reveal the likely chain of events.

Unless the data center power design is totally nuts, any power failure that takes out both primary and secondary systems would have to be in the high voltage primary power coming into the data center (ala the catastrophic power failure at The Planet’s data center in 2009). “Occurred during routine maintenance” is a code phrase that roughly translates as “We were screwing around inside of live power equipment doing something we didn’t really need to be doing and someone messed up”. This has been the cause of many data center power failure events over the last year.

Looking at the history of these events, it’s easy to see this has no relationship whatsoever to cloud computing, nor does it reveal any inherent weakness in cloud computing. So, just what does this outage show?

First, the folly of putting all your critical services in one data center.

Second, that it takes a total “smoking hole” disaster to disrupt cloud computing (showing up the lie that cloud is less reliable than a dedicated server).

Third, that Intuit (and other cloud providers) don’t understand that the consequences of failure in a cloud are far higher than the equivalent failure of a dedicated server and their infrastructure has to be designed for that (failure of a single cloud host will take out 10x or more the service that failure of a single dedicated server will).

Fourth, that Intuit (and other cloud providers) don’t take advantage of the features of clouds to automatically restore downed services. Have a hardware failure in our cloud and, as long as any of the cloud is still running, virtual servers will be restored and running in 15 minutes or less.

Fifth, that Intuit (and other cloud providers) fail to correctly assess the risk of doing “routine maintenance” on live data center power equipment.

So, what does this leave us with? Understand that a single cloud server is far more important than a single dedicated server, segment power so that no one failure will kill everything, run backup services in a separate data center, automate cloud disaster recovery, and stop monkeying around inside of live power equipment.

Would you look at that, it isn’t cloud computing’s fault after all.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

Friday data center tidbits: ghost servers. #datacenter outages, and more!


First up is a piece about what does and what doesn’t work with a green IT strategy. The thing that stood out in this for me was:

“Data center audits inevitably turn up servers with no connections to network cables that remain turned on.”

Anyone who disconnects a server from the data center network and leaves it powered up needs a good swat to the back of the head.

The next piece up is a post from James Hamilton about PUE. As much as I’ve talked here about the flaws in PUE, it’s certainly of some use as an internal metric. The big problem with PUE is trying to compare different data centers based on it, as well as giving out official superiority awards based on it (Energy Star for data centers). It’s probably the most misused metric ever invented.

From the “bozo is contagious” file, we have the recent Bluehost data center outage in Provo, UT. Kudos to Bluehost for actually having a power backup system that worked, squeaky red noses to the local telecom carrier for not only losing Bluehost’s Internet connections but phone service to the whole city as well.

Vern

swiftwater telecom rcs cloud computing logo

Wednesday data center tidbits: HP, manure, #cloud computing, and more!


First up today is reading about HP doing research on using methane from cow manure to power a data center. I’ve seen HP turn things brown and smelly before, but at least they’re admitting it this time. I can just see the slogan for this one, “BROWN is the new GREEN!”.

From the same piece, we get:

“I don’t think you want to ship the manure to Silicon Valley.”

I think that would be slightly redundant.

The next piece is about cloud computing providers failing to get the message across to small to medium enterprises. The telling quote here is this:

“The UK market seems to be confused by jargon and synonymous terminology and appears to have been susceptible to scaremongering by on-premise providers.”

There we go, spread enough doubt and confusion and you can scare anyone away from anything.

Finally we have the piece on building ROI from cloud computing. Under the section “What can I not put in the cloud?”, we get:

“Answers typically include UNIX systems, mainframes …”

Unix can’t go on a cloud? Considering that there are a number of open source Unixes that are perfectly happy on a cloud, I’d call this one a miss. As for mainframes, slap an emulator on a Linux virtual machine and your 1960’s mainframe app can live on a cloud too!

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo