Tag Archives: data center reliability

Jackass: data center edition


Vern Burke, SwiftWater Telecom
Biddeford, Maine

I was just reading a piece about data center temperature levels, cooling, and the ASHRAE recommendations. It’s not the ASHRAE recommendations that are the problem here.

First of all, is anyone really still running servers that can’t handle the ASHRAE recommended maximum inlet temperature of 80 degrees for 20 minutes out failing the hardware? My oldest machines in service are older Rackable Systems 2x dual core Opteron 1U boxes with 2004 bios dates on the Tyan motherboards. These servers are not only perfectly happy at the ASHRAE maximum recommendation, they will run rock solid for months at high 90 degree inlet temperatures in free air cooling configurations.

The next thing is, a 30 degree inlet to outlet temperature differential is “good”? Really? My old Rackables with their highly congested 1U cases produce 13 degrees differential between inlet and outlet. If you have a server gaining 30 degrees at the outlet, you have a server that has a severe problem with its internal airflow. Of course, restricting airflow will make the server fans work harder, driving up the amount of power they require.

So, the original poster pulled a truly jackass stunt:

1. He “accepted” and commissioned a new data center without properly testing the systems.

2. He ran live servers in a data center with no power backup for the cooling systems, only the servers themselves.

3. He allowed servers to run to failure in a completely defective environment. Of course, we’re never told what the actual inlet temperature was on the servers when they failed, it could have been far higher than the ASHRAE recommended maximum.

The problem here isn’t the inlet temperature recommendation, it’s a defective data center design combined with defective operations (did anyone do a maintenance plan before running that fire alarm test?).

I guess if you can’t figure out that backup power for the servers isn’t adequate to be running anything without backup power for the cooling, then you should unload lots of money running your data center temperature low enough to give you time to fix a totally preventable goof.

As for the rest of us, we’ll avoid designing and operating in jackass mode.

Advertisements

Failure to transfer: Bumbling data center power reliability, the iWeb outage.


Vern Burke, SwiftWater Telecom
Biddeford, ME

I’ve just been reading about the recent iWeb data center power failure. Lousy power design and botched operations strikes again.

Even though specifics of iWeb’s data center power configuration weren’t specifically revealed, we can tell a lot from what actually happened. Due to a nearby fire, the data center operators made the decision to shift the facility to emergency power (an entirely reasonable move). The transfer switch serving one of 3 generators failed to transfer, leaving one third of the data center dark when UPS batteries ran out. Where do I start on the boneheaded tricks on this one.

First, we know that the 3 generators were allocated 1 to each third of the facility. This means no generator redundancy. It sounds good to say “we have 3 generators!” until you find out that they’re not being operated in parallel with at least 1 spare (n+1). Right idea, a total swing and whiff on the execution.

Second, it’s apparent that there was no manual bypass for the failed transfer switch. Were they expecting to have to shut down the whole 1/3 of the facility if they ever needed to work on that transfer switch? Dealing with a failed transfer switch shouldn’t be any more difficult than sending someone down to the power room to transfer the power manually.

Third, if they actually did have a manual bypass, were the data center operators informed by the monitoring systems that that section of the data center was still running from UPS and there was enough run time from battery to get someone to the power room to pull the manual bypass? This is a the big problem I have with super short run time backup power such as flywheel UPS. If things don’t go absolutely perfectly in the 15 seconds of runtime you get, you don’t get a chance for a manual fix, you’re going down, period.Of course, splitting the generators into separate “zones” makes the short runtime problem far worse, since it’s much more likely that you’re going to have a total failure with a single generator.

It’s apparent from the article a number of large name providers are doing a similarly lousy job at their backup power redundancy, judging by four transfer switch failures this year with major loss of data center services each time. It’s really a rather pathetic performance.

So, what’s the takeaway from all of this?

1. If you’re going to run multiple generators, run them in parallel and at least n+1. I don’t care how many generators you have, if you’re allocating single generators to single zones, you’re vulnerable.

2. If you’re not going to run the generators in parallel, at least give enough run time from the batteries to deal with the problems you know are going to come up. I don’t care how often you test, if you’re running single generators, failure is going to happen (with this configuration, they could have easily have had this happen during a test!).

3. Make sure there’s a manual bypass for automatic transfer switches and that your operations people have the monitoring and the procedure to know when to pull it.

In a substantially sized data center, the consequences of failing to transfer are a lot worse than doing things right the first time.

iWeb, data center bozos of the week (squeaky red noses are on the way!).

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

swiftwater telecom rcs cloud computing logo

Tuesday data center tidbits: PayPal doesn’t pay, scary virtual servers


Vern Burke, SwiftWater Telecom
Biddeford, ME

First up today is the piece about the recent 3+ hour data center failure at PayPal. Yet another self inflicted injury by a major company who should know better. Just in case the message isn’t clear, it might be a good idea to actually TEST your failover procedures BEFORE things actually fail. (Hmmmm, I wonder where you’d put the parking brake in a data center 🙂 )

Next is the piece about the “scary side” of virtual servers. Everyone is having major angst over theoretical security “problems” when all of the issues I see are failure to follow the same well established best practices that work just fine for physical machines. It’s far easier for an attacker to compromise a virtual server with bad sysadmin practices than to invent a totally new hypervisor attack. Oh, and if you can’t trust your data center admin to do his job without having to be physically roadblocked by other departments, you have worse problems than virtualization.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

swiftwater telecom rcs cloud computing logo

Friday data center tidbits: data centers gone wild, Facebook behind the times


First up today is the piece about the federal government “finding” 1000 more data centers than they thought they had. First of all, how does your data center inventory get so bad that you lose track of 1000 data centers? Second, how in the world do you allow the data center sprawl to get so far out of control? That’s a total of 2100 federal data centers, an average of 42 data centers for every single state. Last but not least, who in the world would think it’s a bad idea to consolidate these?

The federal goverment, data center bozos of the week, the truckloads of red squeaky noses are on their way.

The next piece is about Facebook saving big by retooling its data center cooling. Really, is it big news that not mixing cold intake air with hot exhaust air is a good idea? If Facebook is pushing this as a “look how great we are” point, they’re about 5 years too late.

Finally, here’s a survey about the causes of data center downtime. Not maintaining the data center backup batteries and overloading the backup power are just plain silly, but the telling one for me is 51% accidental human error, including false activation of the EPO (emergency power off). I’ve said it before, the gains that the National Electrical Code allows the data center in exchange for the EPO are NOT worth this kind of failure. EPO=EVIL, period.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Mucking out the data center.


I’ve just been reading this piece about data center contamination. There’s certainly a few good points in there but I think it misses the point by a mile on a lot of it.

There are certainly a number of contaminants that can play havoc with a data center. Copier and laser printer toner is definitely a bad one, very abrasive. Airborne sand and volcanic ash are also terribly abrasive as well. It’s also true that gaseous contaminates can cause oxidation, but I’m not sure what the average data center could reasonably do to counteract that.

At this point, I diverge. First of all, cleaning a data center to “clean room” standards is ridiculous. Most data centers are NOT built as clean rooms and the first person to open a door or the first puff of cooling air under a raised floor used as a plenum is going to blow that in a big hurry, not to mention that the article recommends doing this BEFORE raised floors, servers, and cable racks are installed. What about the dirt involved in this?

Second, what data center equipment really needs or would even measurably benefit from this level of clean? Data center equipment will usually go FAR beyond its useful lifespan with a lot less of a cleaning level than this. Cleaning the data center as if the same IT equipment was going to be run for 20 years is an awful waste of money.

Sticking with hard surfaces and avoiding fabrics in the data center is certainly a good idea, as are mats to trap as much incoming dirt as possible. Proper filters for air handling equipment is also a smart idea.

Zinc whiskers are probably the only contaminant that stand a significant chance in tiny quantities of causing an IT equipment failure in the data center. Zinc whiskers are formed on the surface of zinc plated metal, such as raised floor tiles and frames. The little whiskers break off on air flow (such as using a raised floor for a cold air plenum) and the head right for vulnerable parts of IT equipment, such as power supplies.

There are a bunch of good reasons NOT to used raised floor in the modern data center (added floor weight for no benefit, expense, under floor fire suppression, etc) but zinc whiskers and the inability to decently clean under them are the biggest. Dump the raised floor and you dump nearly everything that could really cause your equipment problems, as well as a major source of dirt.

Cleanliness in the data center is certainly an important issue, but keeping known offenders like raised flooring out and doing smart, cost effective cleaning instead of trying to achieve clean room perfection will be a lot more sensible.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

The top data center operations mistake.


I was just reading about the top 10 data center operations mistakes and I noticed they completely missed by far the most important one of all.

It’s certainly true that lack of proper procedures, training to properly follow the procedures, and training to make sound decisions when the procedure isn’t up to the situation are all important to smooth data center operations. The military wisdom that no plan of battle ever survives contact with the enemy frequently comes into play. The most beautifully crafted and approved procedures don’t mean a thing when it comes to an unanticipated situation and nobody involved can make a smart decision on their own.

The biggest mistake by far, in my opinion, and the one we’ve seen the most examples of in the last several years of data center failures, is failure to analyze risks sensibly. A large percentage of these outages have been the direct result of high risk, zero gain actions.

One good example of this is the irresistible urge to mess around inside of live electrical equipment. There is little to nothing you can do in a data center that is more dangerous and higher risk than working on exposed high voltage power, both to the health and safety of the person doing the working and the operation of the data center itself. The result of screwing this up can be spectacularly catastrophic (witness the transformer explosion at The Planet’s data center in 2009 due to a miswired added high voltage circuit).

Given the hazard, you would think that data centers would reserve that for only the most extreme need, however, many of those failures have been for the most trivial of purposes. One of the bad ones from this year involved a “maintenance” operation to manually check phase rotation in a panel (the correct A-B-C ordering of 3 phase power). Since this phase rotation is pertinent only to 3 phase electric motors, this is ZERO issue with data center IT equipment and far from justifiable, given the risk.

It comes down to a healthy dose of common sense. If you’re going to take all of your data center generator capacity offline to do something trivial and you only have 15 second run time flywheel UPS systems, that’s probably a BAD choice. If you know you restore generator capacity in far less time than the run time of your UPS, that makes a lot more sense.

Don’t do risky things for no gain and you’ll avoid the nightmares of the most preventable data center operations mistakes. It’s just that easy.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

Friday data center tidbits: the cloud industrial revolution and more.


First up today is a piece likening the move to cloud computing to the change from steam power to electricity. It sure is, complete with the same re-occurring bunch of Luddites trying to derail it. Funny how history repeats itself.

Next up is the piece about saving money by building multiple data center tiers in the same facility. First off, I don’t know how this qualifies as “greening” the data center. Second, avoiding under-building or over-building the data center to the needs of what’s going in it is news?

Finally, there’s the speculation that Twitter’s new data center will put paid to the Fail Whale. I doubt it, but now there won’t be any guess work to who’s REALLY at fault when it blows up.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern