Category Archives: data center disaster recovery

High availability, bulletproofing the cloud.

This year has seen some very high profile failures of major cloud computing providers. One of the things that stands out for me in these is an almost total inability to restore running customer workloads without major amounts of manual intervention, usually by the customer themselves.

Silly human caused outages aside, data center infrastructure is almost guaranteed to suffer outages, despite the best efforts to the contrary. physical equipment is fallible, all the way from the data center back up power generators to the fanciest server. This is magnified a lot by cloud computing, since the same infrastructure supports 10-20 times the amount of customers.

What’s giving cloud computing a bad name for reliability isn’t the failures, it’s the lousy response to them. Taking hours or even days to restore customer workloads when the cloud provider still has operating capacity (partial failure) is purely ridiculous. Expecting customers to monitor their virtuals themselves and deal with manually restarting them after a failure is guaranteed to make people even unhappier. This doesn’t even take into account the irritation at the service going down in the first place. I think there’s a WAY better way to handle this.

For quite a while now, our open source project, Xen Cloud Control System for the excellent Xen Cloud Platform, has featured the ability to automatically recover from most cloud failures. Your virtual machine stops running? XCCS automatically restarts it. A physical cloud host fails? XCCS restarts everything that was running on it on the other hosts. No muss, no fuss.

With the release of XCCS ver 0.5.4 today, we’ve introduced the ultimate automated feature to make sure the customer service stays up and running, no matter what. The new “unfriend” feature insures that two virtual servers who are unfriended will never be running on the same physical host server. This means that a partial failure of the cloud or data center infrastructure will NOT take out two redundant virtual servers. Combine this with the automatic restart facility of XCCS and the customer’s service doesn’t even sneeze at the failure.

Want the ultimate in bulletproof web servers? Take two virtual server load balancers, set up heartbeat for failover, and then unfriend them. Add two (or more) virtual web servers and unfriend them. Now you have a self healing web server that will NOT go down for anything short of a total smoking hole disaster. Completely automatic disaster recovery, courtesy of the RELIABLE cloud. This is the way cloud computing should be.

Call me at 207-399-7108 or email me today for bulletproof virtual servers that you can count on to stay up and running through anything!


swiftwater telecom rcs cloud computing logo


Wednesday data center tidbits: Digg lowers the bar

First up today is the piece about the ongoing problems with Digg’s new web site. From the article, we get:

“Our service was falling over and crashing non-stop,” Rose said in the podcast. “It’s still crashing.”

“Even up until days before the launch there were bugs with our datastore,” Rose said. “The plan was to get this live. We knew there would probably be bugs under load. So we launch the site, and it falls over.”

It’s bad enough to have the unexpected happen and whack things or not to test things adequately, but what kind of dimwits would throw something into service that they knew full well in advance had serious stability problems? There aren’t enough red squeaky noses in the world to cover this one.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.


swiftwater telecom rcs cloud computing logo

Thursday data center tidbits: “ventilate” the server, Microsoft bumbles the cloud

First up today is the piece about an employee getting drunk and shooting up his company’s server with a .45 .I confess to having had the urge to kick a server from time to time but never to shoot one. If this isn’t a call for cloud computing, I don’t know what it.

Next up is Microsoft botching their cloud computing service for two hours. I don’t know what to add to this except, thanks for doing your part to add to the perception that cloud computing is unreliable, twits.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.


swiftwater telecom rcs cloud computing logo

Extreme weather and the data center.

I’ve been sitting here this evening operating the data center under extreme weather protocols due to wild electrical storms and tornado warnings. I thought I’d take a few minutes and discuss how to protect a data center during extreme weather events.

Whether you subscribe to the idea of global warming or not, it’s apparent that this has already been a bumper year for violent weather. High winds, lightning, heavy rain, none of it is very conducive to keeping the data center up and operating. Obviously, being able to shut down is the best protection (this is where cloud computing really shines, the capability of moving services out of harm’s way), but what do you do when you can’t just shut it all down?

Here’s my weather protocol for tonight:

1. Identify critical services and the capacity needed to minimally run them. In this case, I was able to substantially reduce data center power load by shutting down redundant services and shutting down cloud computing capacity that wasn’t required to keep the critical services operating. Remember, reduced power load means extended run time on the backup power.

2. Transfer workloads to an alternate data center.

3. Reduce cooling capacity to reflect the lower data center power load (less load, more run time!). Insure that there is no water or wind infiltration via cooling system intake vents. In my case, I change the free air cooling system to passive intake to avoid blowing in water.

4. Secure all windows and doors against high winds. If an area can’t be reasonably secured, such as an area with large, vulnerable, plate glass windows, secure inner doors to isolate the vulnerable area.

5. Reduce power equipment capacity equivalent to power load reduction. Open breakers or unjack equipment to isolate it from any damage from extreme power events, such as a close lightning hit on the AC commercial power.

6. Make sure that emergency supplies and emergency lighting are all up to par.

7. Know what to grab and take and how to secure the data center in case the situation is bad enough to require abandoning the data center.

My previous post on dealing with a data center flood also applies to this as well.

Follow these protocols or use them as a starting point for your own and you’ll find that your data center can make it through almost anything Mother Nature can throw at you intact.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.


swiftwater telecom rcs cloud computing logo

How to deal with a data center flood.

Next to fire, flooding is one of the most catastrophic things that can happen to a data center. In this post, I’ll give some tips on how to minimize the impact of data center flooding, before, during, and after the event.

BF (Before Flood)

1. Place all electrical entrances, transfer switches, UPS, generators, and batteries above grade. This pretty much goes without saying. The basement looks like a convenient place for infrastructure gear but it’s also dead center bulls eye for flooding.

2. Don’t bury transformers in below grade vaults.

3. No basement drains, sinks, or toilets.

4. Thoroughly seal any below grade utility entrances, such as fiber.

5. Insure that all cooling air intakes and generator air intakes and exhausts are well above any expected flood level. Suck contaminated flood water into a running generator and you’ve got a major problem.

6. Portable plank barriers to temporarily raise door sill levels.

DF (During Flood)

So, the flood turns out to be bad enough that you have to abandon the data center. How do you prepare the facility for the worst?

1. Shut down, lock off, and denergize all generator controls.

2. Lock off all battery strings and remove the intercell connectors. This lowers the chances of the battery discharging through the contaminated flood water.

3. Top all flooded cell batteries off with water. This helps prevent contaminated water from getting in.

4. Shut down the data center and pull the main AC power. Water and electricity do NOT mix.

5. Move the most vulnerable and portable things to a safe location, possibly on an upper floor. This includes spare rectifier modules, other spare equipment, paper records, backup media, and server hard drives.

AF (After Flood)

Now the worst is over, how do you mop up?

1. Pump out all contaminated flood water immediately.

2. Bring in temporary power sources, restore the ventilation systems, and get building and contents dried out as quickly as possible. Do not restore any equipment or attempt to operate any generator that has been under water until it’s thoroughly dried, inspected, and (in the case of a generator) rehabbed.

3. Disconnect all loads from the power equipment. Restore the rectifiers or UPS and the battery string.

4. Restore equipment in small steps, starting with the network connections and working back toward the servers. Inspect all restored equipment for catastrophic failures when power is reapplied (this is why you just don’t restore the whole thing at once).

5. Inspect equipment that has been under water on an ongoing basis for corrosion problems and replace as required.

Hopefully you’ve built your data center in a location that’s unlikely to flood, but, if the worst happens, you’re now prepared to deal with it.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.


swiftwater telecom rcs cloud computing logo

Extreme weather, #datacenter DC power, and #cloudcomputing.

Or, as the alternate title that comes to mind, “Mother Nature throws a hissy fit”. I’m going to talk in this post about how all of the above link together to affect data center and cloud computing reliability.

This year, it seem like the news is full of stories of violent weather around the country and it only seems to be getting worse. Even areas that traditionally have been fairly stable weather wise have been showing massive storms with damaging winds and flooding rains. For the first time in the 6 years I’ve been in this location (not to mention most of my life in this state), we’ve had 2 major spring/early summer storms with winds in excess of hurricane force.

So, how does this relate to data center and cloud computing reliability? The last storm materialized in only 5 minutes, produced torrential downpours and 100 mph winds, and caused large amounts of havoc with the commercial AC power supply to the data center. I’m a great proponent of green DC power in the data center so the power distribution is primarily DC with AC equipped with good quality traditional protection for the rest.

Unfortunately, the AC power excursion events from the severe weather were wild enough that the classic power protection turned out to be inadequate. The cloud computing servers themselves, powered by DC as they are, survived just fine. Both the primary and backup storage systems, powered from the AC, did not.

After several days of cleaning up the mess and getting the cloud restored and back on line, there are a number of takeaways from this.

1. It’s hard to go overboard engineering your data center for extreme weather, whether it’s likely to happen or not.

2. Data center DC power is a LOT more resilient than the best protected AC power. As a result of this, all required AC powered equipment is now on the DC power via inverters. This isn’t as green of course, but it isolates the equipment much better from radical power fluctuations in the data center AC supply.

3. In a cloud computing environment, make sure all the parts of the cloud have the same level of resiliency. There’s no point to keeping the front end alive when the back end goes down.

Finally, I’ve talked in a previous post about using DC power with a long run battery string to shift heat load in the data center. A DC power system with a long run time is also great protection against this type of event. No matter how fast or unexpected the severe weather is, simply shut down the rectifiers in minutes, run from the batteries, and you have the ultimate in isolation from AC power excursions.

Or, we could just write Mother Nature a prescription for Prozac.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.


Intuit, #datacenter, and #cloudcomputing (the three horsemen of the apocalypse)

I knew it was going to happen just the moment I read about Intuit face planting their data center and web sites for 36 hours. The anti cloud computing crowd are out in force with their mantra that this “proves” that cloud computing is unreliable. What is does prove is that if people can’t come up with a good argument against something, a silly one will do in a pinch.

So, what exactly happened with Intuit? We do know that a power failure as the result of “routine maintenance” took down both primary and backup servers. I havn’t seen a detailed analysis of it yet, but a little deductive reasoning will reveal the likely chain of events.

Unless the data center power design is totally nuts, any power failure that takes out both primary and secondary systems would have to be in the high voltage primary power coming into the data center (ala the catastrophic power failure at The Planet’s data center in 2009). “Occurred during routine maintenance” is a code phrase that roughly translates as “We were screwing around inside of live power equipment doing something we didn’t really need to be doing and someone messed up”. This has been the cause of many data center power failure events over the last year.

Looking at the history of these events, it’s easy to see this has no relationship whatsoever to cloud computing, nor does it reveal any inherent weakness in cloud computing. So, just what does this outage show?

First, the folly of putting all your critical services in one data center.

Second, that it takes a total “smoking hole” disaster to disrupt cloud computing (showing up the lie that cloud is less reliable than a dedicated server).

Third, that Intuit (and other cloud providers) don’t understand that the consequences of failure in a cloud are far higher than the equivalent failure of a dedicated server and their infrastructure has to be designed for that (failure of a single cloud host will take out 10x or more the service that failure of a single dedicated server will).

Fourth, that Intuit (and other cloud providers) don’t take advantage of the features of clouds to automatically restore downed services. Have a hardware failure in our cloud and, as long as any of the cloud is still running, virtual servers will be restored and running in 15 minutes or less.

Fifth, that Intuit (and other cloud providers) fail to correctly assess the risk of doing “routine maintenance” on live data center power equipment.

So, what does this leave us with? Understand that a single cloud server is far more important than a single dedicated server, segment power so that no one failure will kill everything, run backup services in a separate data center, automate cloud disaster recovery, and stop monkeying around inside of live power equipment.

Would you look at that, it isn’t cloud computing’s fault after all.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.