Tag Archives: disaster recovery

What you REALLY should be asking your cloud computing provider.


Vern Burke, SwiftWater Telecom
Biddeford, ME

I’ve been reading quite a bit of back and forth over cloud computing provider Service Level Agreements and endless checklists that purport to be everything a customer should ask a potential cloud computing provider. There’s a lot of aimless noise and sniping back and forth that is missing a very critical point.

Let me start of by saying, as an approach to insuring service uptime, classic percentage based SLAs are worthless. The SLA tells you nothing about whether you can expect your cloud computing provider to keep your service running or not, only the penalty if they fail. Your goal as a cloud computing customer isn’t to extract the maximum compensation for the failure of your cloud service, it’s to insure that your cloud service doesn’t fail in the first place!

Failures of data center hardware are an inevitable fact of life, even with the most conscientious engineering and operation. The data center failures of the past year have shown that even the big data centers fall well short of the “most conscientious engineering and operation” goal. Given this fact of life, here are the things you should REALLY be asking your cloud computing provider.

1. Do they have the ability to automatically restart workloads on the remaining running physical cloud capacity if part of the capacity fails.

This is something that has stood out like a sore thumb with a lot of the big cloud providers. Failures of underlying physical hardware on Amazon’s AWS service kills the workloads that were running on that hardware, even though the vast amount of capacity is still up and running in the cloud. If your cloud provider can’t automatically restart your workload in case of failure, run away.

In the SwiftWater Telecom cloud (using the Xen Cloud Control System) failed workloads are restarted automatically on remaining cloud capacity in minutes, not hours.

2. Do they offer a way to insure that redundant virtual servers never end up on the same physical server?

It doesn’t do much good to have redundant virtual servers in the cloud if they all die when one single physical host dies.

In the SwiftWater Telecom cloud, we offer what we call the Ultra Server. The Ultra Server combines redundant virtual servers with the “unfriend” feature of XCCS. Unfriending insures that the redundant servers never end up on the same physical server.

This means that nothing but a “smoking hole disaster” would ever cause a complete outage of an Ultra Server. Combine that with the automatic virtual server restoration of XCCS and the option to set up the Ultra Server between our separate data centers, and you have a cloud powered virtual server that is the next best thing to indestructible.

Stop concentrating on the penalties for failure, ask the right questions to find the cloud computing provider who has the right tools to keep your service from failing in the first place.

Advertisements

High availability, bulletproofing the cloud.


This year has seen some very high profile failures of major cloud computing providers. One of the things that stands out for me in these is an almost total inability to restore running customer workloads without major amounts of manual intervention, usually by the customer themselves.

Silly human caused outages aside, data center infrastructure is almost guaranteed to suffer outages, despite the best efforts to the contrary. physical equipment is fallible, all the way from the data center back up power generators to the fanciest server. This is magnified a lot by cloud computing, since the same infrastructure supports 10-20 times the amount of customers.

What’s giving cloud computing a bad name for reliability isn’t the failures, it’s the lousy response to them. Taking hours or even days to restore customer workloads when the cloud provider still has operating capacity (partial failure) is purely ridiculous. Expecting customers to monitor their virtuals themselves and deal with manually restarting them after a failure is guaranteed to make people even unhappier. This doesn’t even take into account the irritation at the service going down in the first place. I think there’s a WAY better way to handle this.

For quite a while now, our open source project, Xen Cloud Control System for the excellent Xen Cloud Platform, has featured the ability to automatically recover from most cloud failures. Your virtual machine stops running? XCCS automatically restarts it. A physical cloud host fails? XCCS restarts everything that was running on it on the other hosts. No muss, no fuss.

With the release of XCCS ver 0.5.4 today, we’ve introduced the ultimate automated feature to make sure the customer service stays up and running, no matter what. The new “unfriend” feature insures that two virtual servers who are unfriended will never be running on the same physical host server. This means that a partial failure of the cloud or data center infrastructure will NOT take out two redundant virtual servers. Combine this with the automatic restart facility of XCCS and the customer’s service doesn’t even sneeze at the failure.

Want the ultimate in bulletproof web servers? Take two virtual server load balancers, set up heartbeat for failover, and then unfriend them. Add two (or more) virtual web servers and unfriend them. Now you have a self healing web server that will NOT go down for anything short of a total smoking hole disaster. Completely automatic disaster recovery, courtesy of the RELIABLE cloud. This is the way cloud computing should be.

Call me at 207-399-7108 or email me today for bulletproof virtual servers that you can count on to stay up and running through anything!

Vern

swiftwater telecom rcs cloud computing logo

Extreme weather and the data center.


I’ve been sitting here this evening operating the data center under extreme weather protocols due to wild electrical storms and tornado warnings. I thought I’d take a few minutes and discuss how to protect a data center during extreme weather events.

Whether you subscribe to the idea of global warming or not, it’s apparent that this has already been a bumper year for violent weather. High winds, lightning, heavy rain, none of it is very conducive to keeping the data center up and operating. Obviously, being able to shut down is the best protection (this is where cloud computing really shines, the capability of moving services out of harm’s way), but what do you do when you can’t just shut it all down?

Here’s my weather protocol for tonight:

1. Identify critical services and the capacity needed to minimally run them. In this case, I was able to substantially reduce data center power load by shutting down redundant services and shutting down cloud computing capacity that wasn’t required to keep the critical services operating. Remember, reduced power load means extended run time on the backup power.

2. Transfer workloads to an alternate data center.

3. Reduce cooling capacity to reflect the lower data center power load (less load, more run time!). Insure that there is no water or wind infiltration via cooling system intake vents. In my case, I change the free air cooling system to passive intake to avoid blowing in water.

4. Secure all windows and doors against high winds. If an area can’t be reasonably secured, such as an area with large, vulnerable, plate glass windows, secure inner doors to isolate the vulnerable area.

5. Reduce power equipment capacity equivalent to power load reduction. Open breakers or unjack equipment to isolate it from any damage from extreme power events, such as a close lightning hit on the AC commercial power.

6. Make sure that emergency supplies and emergency lighting are all up to par.

7. Know what to grab and take and how to secure the data center in case the situation is bad enough to require abandoning the data center.

My previous post on dealing with a data center flood also applies to this as well.

Follow these protocols or use them as a starting point for your own and you’ll find that your data center can make it through almost anything Mother Nature can throw at you intact.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

#datacenter failures and restoring a cloud #cloudcomputing


I’ve been reading lately about the multiple recent failures that have plagued Amazon’s EC2 cloud computing service (4 in one week!). I’m not going to address the ongoing power reliability problems (I was amuse to see a recent YouTube video touting how reliable Amazon is) but I’m going to talk about cloud failure recovery.

The failure modes in a cloud computing system can include inadvertant virtual machine shutdown, partial failure of connectivity or operation of the cloud (likely due to data center failures such as Amazon’s power failures), or total failure of the entire cloud (the smoking hole disaster). Failures will always happen but taking 7 hours to restore customer virtual machines (as Amazon has done) is pathetic.

In our own cloud, based on the excellent Xen Cloud Platform, we’ve created a fully automated front end package called Xen Cloud Control System. As part of XCCS, we’ve included 3 pieces that automate the response to these type of cloud failures.

The virtual machine watchdog finds and restarts virtual machines that get unintentionally shut down, automatically, without any need for a report from the customer. Remember, detecting and resolving a failure before the customer squawks is the best policy.

The host watchdog detects and clears failed host servers so that the virtual machine watchdog can restart the virtual machines. This gets the virtual machines restarted on another running host, instead of waiting for the failed host to be repaired as Amazon seems to require.

The final piece of the puzzle is the load balancer. The load balancer automatically readjusts the cloud, compensating for failures.

These 3 pieces not only keep the cloud running as smoothly as possible, they insure the maximum time to restore customer virtual machines is 15 minutes as long as any of the cloud is still running. This is also all done automatically, no operator intervention required.

I have no idea what the heck Amazon is doing, but if the worst happens, 15 minutes to restore service on a cloud beats the stuffing out of 7 hours.

Vern

Thursday data center tidbits.


This hasn’t been a good week for major Internet services. First up was Wikipedia going splat yesterday due to an overheating situation in the data center and then a failure of their failover procedure and here. I won’t start on the failure of their data center provider to detect and head off whatever infrastructure failure cause the overheating before it reached catastrophe point, but discovering your failover plan is garbage during said catastrophe is just pathetic.

Update:Yes, I will start on them. DCK reports that the EU data center that Wikipedia is in uses free air cooling. How in HECK do you have a major cooling failure when Mother Nature is doing the work?

Next up is the failure of YouTube’s main web site for approximately 90 minutes this morning. No word on the cause of this one yet but I think it must be something in the water.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services and reliable green data center services Mother Nature doesn’t go on strike on us!).

Vern

swiftwater telecom rcs cloud computing logo

Tuesday data center tidbits.


Best cloud computing quote of the day, Tim O’Reilly on cloud computing:

“One of the things we really need to watch out for is that we don’t hold (cloud deployment) back because we have some storyline about how ‘that’s too dangerous.’ I’m getting old enough to remember when that’s what they said about giving people PCs”

Couldn’t have said it better myself (yes, I’m getting that old too!).

Next up is the story about Curtin University in Australia shutting down their computer facilities during a severe storm. Everyone wants to keep their computers running 24×7 but sometimes discretion is the better part of valor. We just went through that in the face of hurricane force winds, flooding rain, and major power failure (after migrating critical cloud computing services to our backup data center out of the storm’s path 🙂 ).

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

Avoiding data center disaster with cloud computing.


I’ve written previously on this blog about the theory of leveraging cloud computing to move virtual machines and critical computing services out of the way of natural disasters. Today I’m going to talk about doing it in real life.

Over Monday night of this week, 02/22/10, the location of our south data center began to see the effects of the recent winter storm that left 1 million people in the north east without power when all was said and done. Heavy rain was causing local flooding (nothing for the data center to worry about) and wind gusts over 40mph. By Thursday, the continuous downpour was bringing extensive flooding and the wind was steadily increasing.

At 9:30pm on Thursday night, I got an alarm from the data center and immediately headed in. At this point, the downpour was sheeting sideways in steady 60mph wind, gusting to hurricane strength 90+mph. As I neared the data center, I could see the entire industrial/commercial complex that we’re in was totally dark, right in the heart of the city, total power brownout (120VAC legs were down to 35VAC) (no wonder the BlackBerry was hollering at me!).

In light of the extreme weather conditions and to prevent equipment damage to the electrical brownout, I decided to evacuate and shut down the south data center. The order of business was to transfer all critical cloud services to the backup cloud in our north data center (which was on the other side of the snow line and didn’t get the extreme winds), DNS, email, critical customers, and then backup email service for the customers that there simply wasn’t enough time to move across. After that, it was a nice orderly shutdown, which finished with the emergency lights running out of battery and leaving me in the dark.

Friday was sort out the chaos time. Within 10 minutes of main power being restored, uplinks and internal network were running. Within 30 minutes, cloud servers were operational, critical services were being transferred back, and customer virtual machines restarted. Within 2 hours, all virtual machines, all co-located servers were fully restored and operating.

While I wasn’t able to keep all services running through the night, all critical data center services, email and backup email, and critical customer services continued to run, all without a hiccup, thanks to the ability to easily move virtual machines from one data center to the other.

The best kind of disaster is the one you can avoid!

Call or email me or visit the SwiftWater Telecom web site for hurricane resistant green data center and cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo