Tag Archives: redundancy

The green data center, no trade off required for reliability.


Today I was reading about the “trade off between reliability and efficiency” in the data center. I think it’s far from the truth that you have to give up one for the other.

Part of the problem that causes this kind of misconception are obsolete classification systems, such as the Uptime Institute’s tiers (I’ve written before about my problems with that particular classification system). In the example given in the article, the data center operator in question had to maintain 1 to 1 hot standby servers for every operating server to achieve that particular tier rating, as if reliability couldn’t be achieved by anything less than exact duplicates of every piece of gear in the data center. Of course, the 2N approach ignores the possibility, what if you waste all that money and energy running 1 to 1 hot standbys, the primary fails, and then the secondary immediately fails?

Of course, the Uptime Institute’s response to this is to announce yet ANOTHER data center efficiency metric.

This also spotlights the weakness of using PUE as anything but an internal engineering aid. It sounds great that you have a 1.8 PUE but, since PUE doesn’t have any reference to the amount of work being accomplished, you’ll be wasting half the energy consumed on equipment producing no useful work. The cost of operating this way plus the likely upcoming carbon penalties will melt your credit card.

So, how you combine green data center techniques with high reliability? Here’s my recipe for it:

Add a green modular DC power plant with enough modules to provide n+1 (or 2 or 3) capacity. Split AC power feeds for the modules between 2 or more AC sources.

Add 2 parallel battery strings.

Add in 1 cloud computing cluster, such as the excellent Xen Cloud Platform we use. Provision enough cloud hosts for n+1.

Split cloud hosts with multiple DC power feeds.

Add in redundant storage servers.

Add in a load balancing system capable of automatically restarting virtual machines if a cloud host fails.

Add in good site design practices.

(note, this is exactly the way we run)

The fact is that nothing short of a smoking hole disaster is likely to interrupt the service provided by this configuration for longer than the time required to automatically restart a virtual machine. If the time to restart is an issue, specify a backup virtual machine on a different host in the cloud. Protect against the possible data center wide disaster with a duplicate configuration on a cloud in a second data center.

It’s really not that hard to achieve high reliability with green data center techniques (even if Microsoft, Amazon, Google, and Rackspace make it look like it is). Deep six the antiquated thinking and your wallet and the planet will thank you for it.

High reliability services available!

Vern

swiftwater telecom logo

Advertisements

Is your data center the hero or the goat?


The guys from IBM can’t seem to catch a break. First it was a spectacular crash and burn in New Zealand, now it’s the State of Texas dropping IBM.

13 days of outage for a government agency of any sort is painful enough. IBM however made it worse by blaming the outage and data loss on an old piece of SAN equipment that they took over from Texas as part of their contract to upgrade and replace the systems in question. They knew the equipment was suspect and simply didn’t bother to put in place even a temporary backup until it was replaced.

Self inflicted catastrophes are the worst. It doesn’t matter whose equipment it was originally, once you take that kind of responsibility for it, you take the lumps when it cracks up. I’m sure TX wasn’t paying for temporary backup space for this thing, but it was an easy way to look like a hero and would have been far cheaper than what they lost. Instead they end up as the goat, not to mention the credibility whack from knowing it was an extreme risk and allowing it to happen.

How much is it worth to your data center to be the hero and not the goat?

Vern, SwiftWater Telecom

data center facility engineering

Adventures in the data center: The cascade effect …


This weekend’s adventure in the data center came courtesy of an expensive (and now defunct) APC rackmount power strip. What was interesting about this was not the failure of the unit (although I’ve never seen a catastrophic power strip failure in my 20+ years of this) but the cascading effects.

What are cascading effects? Cascading effects are widespread problems that begin from a small starting point and snowball into a much larger problem. In this case, the APC failure caused a fair amount of equipment other than what was directly connected to it to fail (leading to a near meltdown of the BlackBerry).

Cascades can make it very difficult to diagnose the fault. Large amounts of “collateral damage” can lead to information overload (yes, there IS such a thing as too much info!). This can easily lead to things taking far too long to diagnose or an inaccurate result (which can be awful if a tech drives 3 hours with the wrong part).

The first key to preventing the dreaded cascade is segmentation. This provides natural “breaks” to block the spread of a cascade, limiting the size and effect of the failure. Remember, the smaller the scope, the more likely a fast and accurate diagnosis is. It may seem easier and faster to just tie everything into a flat, unified architecture, but that’s just opening the door.

The second key is to program equipment intelligently, if not to stop the cascade, at least not to contribute to it. A great example of how to do this wrong was one of the recent Google outages. As an overloaded router responded slower and slower, it recognized that it wasn’t performing to par and simply dumped its load on the rest of the routers in the pool. This caused more routers to overload, dumping their loads to the ever shrinking pool, until, finally, there weren’t any routers left at all to serve any traffic. Now you have the interesting conundrum of how to restore service, since any router you managed to clear and restore to service would almost immediately overload again!

The final key is not to overload with information about the failure. Identify dependencies and set your monitoring system up accordingly (device A is dependent on device B to work so don’t alarm on A if B is down). Remember, concise and accurate information gets the diagnosis and the fix the fastest!

There is a brief rundown on keeping the evil cascades out of your data center. Now I’m going to take this APC unit out in the parking lot and run over it with a truck.

Vern, SwiftWater Telecom
data center, web hosting, Internet engineering

Who’s the fool? Data center redundancy.


Tonight’s post comes courtesy of the article Only fools do N+1. I’m not a fool but I’ll examine where this concept goes wrong.

First of all, just what is N? N is the amount of equipment required to support operation. So what is N+1? It’s a formula for redundancy that calls for the amount of equipment that is needed to support operation plus one spare. This allows the failure of one unit without disrupting service.

The source article claims that anything less than 2N, completely separate duplicate sets of equipment, is acceptable. This kind of redundancy is great for a nuclear power plant or a hospital but massive overkill and waste of money for the rest of us.

The first myth is that a failure in N+1 typically cascades into more than one failure. By definition, this doesn’t happen, since N is enough to handle the load by itself. The correct decision here is based on the probability of another failure versus the time required to repair. If there is a long repair time and the potential of another failure, go with N+2 or more to insure coverage.

The second argument is that an N+1 system has at least one single point of failure in it. That may be true in some configurations, but consider the single point and how likely it is to fail.

As an example, I use a DC power plant in the data center. This is a modular unit so I have one extra module plugged in for redundancy. Is there a single thing in this unit that could interrupt service? The buss bars. I consider to odds of a buss bar failing to be too small to justify the massive expense of a complete second system. No other failure in the installation will interrupt service, even a failure of the single AC power feed.

And I did it without wasting a ton of cash on a full duplicate system.

Vern, SwiftWater Telecom
Data center, web hosting, Internet engineering