Tag Archives: disaster recovery

What you REALLY should be asking your cloud computing provider.

Vern Burke, SwiftWater Telecom
Biddeford, ME

I’ve been reading quite a bit of back and forth over cloud computing provider Service Level Agreements and endless checklists that purport to be everything a customer should ask a potential cloud computing provider. There’s a lot of aimless noise and sniping back and forth that is missing a very critical point.

Let me start of by saying, as an approach to insuring service uptime, classic percentage based SLAs are worthless. The SLA tells you nothing about whether you can expect your cloud computing provider to keep your service running or not, only the penalty if they fail. Your goal as a cloud computing customer isn’t to extract the maximum compensation for the failure of your cloud service, it’s to insure that your cloud service doesn’t fail in the first place!

Failures of data center hardware are an inevitable fact of life, even with the most conscientious engineering and operation. The data center failures of the past year have shown that even the big data centers fall well short of the “most conscientious engineering and operation” goal. Given this fact of life, here are the things you should REALLY be asking your cloud computing provider.

1. Do they have the ability to automatically restart workloads on the remaining running physical cloud capacity if part of the capacity fails.

This is something that has stood out like a sore thumb with a lot of the big cloud providers. Failures of underlying physical hardware on Amazon’s AWS service kills the workloads that were running on that hardware, even though the vast amount of capacity is still up and running in the cloud. If your cloud provider can’t automatically restart your workload in case of failure, run away.

In the SwiftWater Telecom cloud (using the Xen Cloud Control System) failed workloads are restarted automatically on remaining cloud capacity in minutes, not hours.

2. Do they offer a way to insure that redundant virtual servers never end up on the same physical server?

It doesn’t do much good to have redundant virtual servers in the cloud if they all die when one single physical host dies.

In the SwiftWater Telecom cloud, we offer what we call the Ultra Server. The Ultra Server combines redundant virtual servers with the “unfriend” feature of XCCS. Unfriending insures that the redundant servers never end up on the same physical server.

This means that nothing but a “smoking hole disaster” would ever cause a complete outage of an Ultra Server. Combine that with the automatic virtual server restoration of XCCS and the option to set up the Ultra Server between our separate data centers, and you have a cloud powered virtual server that is the next best thing to indestructible.

Stop concentrating on the penalties for failure, ask the right questions to find the cloud computing provider who has the right tools to keep your service from failing in the first place.

High availability, bulletproofing the cloud.

This year has seen some very high profile failures of major cloud computing providers. One of the things that stands out for me in these is an almost total inability to restore running customer workloads without major amounts of manual intervention, usually by the customer themselves.

Silly human caused outages aside, data center infrastructure is almost guaranteed to suffer outages, despite the best efforts to the contrary. physical equipment is fallible, all the way from the data center back up power generators to the fanciest server. This is magnified a lot by cloud computing, since the same infrastructure supports 10-20 times the amount of customers.

What’s giving cloud computing a bad name for reliability isn’t the failures, it’s the lousy response to them. Taking hours or even days to restore customer workloads when the cloud provider still has operating capacity (partial failure) is purely ridiculous. Expecting customers to monitor their virtuals themselves and deal with manually restarting them after a failure is guaranteed to make people even unhappier. This doesn’t even take into account the irritation at the service going down in the first place. I think there’s a WAY better way to handle this.

For quite a while now, our open source project, Xen Cloud Control System for the excellent Xen Cloud Platform, has featured the ability to automatically recover from most cloud failures. Your virtual machine stops running? XCCS automatically restarts it. A physical cloud host fails? XCCS restarts everything that was running on it on the other hosts. No muss, no fuss.

With the release of XCCS ver 0.5.4 today, we’ve introduced the ultimate automated feature to make sure the customer service stays up and running, no matter what. The new “unfriend” feature insures that two virtual servers who are unfriended will never be running on the same physical host server. This means that a partial failure of the cloud or data center infrastructure will NOT take out two redundant virtual servers. Combine this with the automatic restart facility of XCCS and the customer’s service doesn’t even sneeze at the failure.

Want the ultimate in bulletproof web servers? Take two virtual server load balancers, set up heartbeat for failover, and then unfriend them. Add two (or more) virtual web servers and unfriend them. Now you have a self healing web server that will NOT go down for anything short of a total smoking hole disaster. Completely automatic disaster recovery, courtesy of the RELIABLE cloud. This is the way cloud computing should be.

Call me at 207-399-7108 or email me today for bulletproof virtual servers that you can count on to stay up and running through anything!

Vern

swiftwater telecom rcs cloud computing logo

Extreme weather and the data center.

I’ve been sitting here this evening operating the data center under extreme weather protocols due to wild electrical storms and tornado warnings. I thought I’d take a few minutes and discuss how to protect a data center during extreme weather events.

Whether you subscribe to the idea of global warming or not, it’s apparent that this has already been a bumper year for violent weather. High winds, lightning, heavy rain, none of it is very conducive to keeping the data center up and operating. Obviously, being able to shut down is the best protection (this is where cloud computing really shines, the capability of moving services out of harm’s way), but what do you do when you can’t just shut it all down?

Here’s my weather protocol for tonight:

1. Identify critical services and the capacity needed to minimally run them. In this case, I was able to substantially reduce data center power load by shutting down redundant services and shutting down cloud computing capacity that wasn’t required to keep the critical services operating. Remember, reduced power load means extended run time on the backup power.

2. Transfer workloads to an alternate data center.

3. Reduce cooling capacity to reflect the lower data center power load (less load, more run time!). Insure that there is no water or wind infiltration via cooling system intake vents. In my case, I change the free air cooling system to passive intake to avoid blowing in water.

4. Secure all windows and doors against high winds. If an area can’t be reasonably secured, such as an area with large, vulnerable, plate glass windows, secure inner doors to isolate the vulnerable area.

5. Reduce power equipment capacity equivalent to power load reduction. Open breakers or unjack equipment to isolate it from any damage from extreme power events, such as a close lightning hit on the AC commercial power.

6. Make sure that emergency supplies and emergency lighting are all up to par.

7. Know what to grab and take and how to secure the data center in case the situation is bad enough to require abandoning the data center.

My previous post on dealing with a data center flood also applies to this as well.

Follow these protocols or use them as a starting point for your own and you’ll find that your data center can make it through almost anything Mother Nature can throw at you intact.

Email or call me or visit the SwiftWater Telecom web site for green data center services today.

Vern

swiftwater telecom rcs cloud computing logo

#datacenter failures and restoring a cloud #cloudcomputing

I’ve been reading lately about the multiple recent failures that have plagued Amazon’s EC2 cloud computing service (4 in one week!). I’m not going to address the ongoing power reliability problems (I was amuse to see a recent YouTube video touting how reliable Amazon is) but I’m going to talk about cloud failure recovery.

The failure modes in a cloud computing system can include inadvertant virtual machine shutdown, partial failure of connectivity or operation of the cloud (likely due to data center failures such as Amazon’s power failures), or total failure of the entire cloud (the smoking hole disaster). Failures will always happen but taking 7 hours to restore customer virtual machines (as Amazon has done) is pathetic.

In our own cloud, based on the excellent Xen Cloud Platform, we’ve created a fully automated front end package called Xen Cloud Control System. As part of XCCS, we’ve included 3 pieces that automate the response to these type of cloud failures.

The virtual machine watchdog finds and restarts virtual machines that get unintentionally shut down, automatically, without any need for a report from the customer. Remember, detecting and resolving a failure before the customer squawks is the best policy.

The host watchdog detects and clears failed host servers so that the virtual machine watchdog can restart the virtual machines. This gets the virtual machines restarted on another running host, instead of waiting for the failed host to be repaired as Amazon seems to require.

The final piece of the puzzle is the load balancer. The load balancer automatically readjusts the cloud, compensating for failures.

These 3 pieces not only keep the cloud running as smoothly as possible, they insure the maximum time to restore customer virtual machines is 15 minutes as long as any of the cloud is still running. This is also all done automatically, no operator intervention required.

I have no idea what the heck Amazon is doing, but if the worst happens, 15 minutes to restore service on a cloud beats the stuffing out of 7 hours.

Vern

Thursday data center tidbits.

This hasn’t been a good week for major Internet services. First up was Wikipedia going splat yesterday due to an overheating situation in the data center and then a failure of their failover procedure and here. I won’t start on the failure of their data center provider to detect and head off whatever infrastructure failure cause the overheating before it reached catastrophe point, but discovering your failover plan is garbage during said catastrophe is just pathetic.

Update:Yes, I will start on them. DCK reports that the EU data center that Wikipedia is in uses free air cooling. How in HECK do you have a major cooling failure when Mother Nature is doing the work?

Next up is the failure of YouTube’s main web site for approximately 90 minutes this morning. No word on the cause of this one yet but I think it must be something in the water.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services and reliable green data center services Mother Nature doesn’t go on strike on us!).

Vern

swiftwater telecom rcs cloud computing logo

Tuesday data center tidbits.

Best cloud computing quote of the day, Tim O’Reilly on cloud computing:

“One of the things we really need to watch out for is that we don’t hold (cloud deployment) back because we have some storyline about how ‘that’s too dangerous.’ I’m getting old enough to remember when that’s what they said about giving people PCs”

Couldn’t have said it better myself (yes, I’m getting that old too!).

Next up is the story about Curtin University in Australia shutting down their computer facilities during a severe storm. Everyone wants to keep their computers running 24×7 but sometimes discretion is the better part of valor. We just went through that in the face of hurricane force winds, flooding rain, and major power failure (after migrating critical cloud computing services to our backup data center out of the storm’s path :) ).

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

Avoiding data center disaster with cloud computing.

I’ve written previously on this blog about the theory of leveraging cloud computing to move virtual machines and critical computing services out of the way of natural disasters. Today I’m going to talk about doing it in real life.

Over Monday night of this week, 02/22/10, the location of our south data center began to see the effects of the recent winter storm that left 1 million people in the north east without power when all was said and done. Heavy rain was causing local flooding (nothing for the data center to worry about) and wind gusts over 40mph. By Thursday, the continuous downpour was bringing extensive flooding and the wind was steadily increasing.

At 9:30pm on Thursday night, I got an alarm from the data center and immediately headed in. At this point, the downpour was sheeting sideways in steady 60mph wind, gusting to hurricane strength 90+mph. As I neared the data center, I could see the entire industrial/commercial complex that we’re in was totally dark, right in the heart of the city, total power brownout (120VAC legs were down to 35VAC) (no wonder the BlackBerry was hollering at me!).

In light of the extreme weather conditions and to prevent equipment damage to the electrical brownout, I decided to evacuate and shut down the south data center. The order of business was to transfer all critical cloud services to the backup cloud in our north data center (which was on the other side of the snow line and didn’t get the extreme winds), DNS, email, critical customers, and then backup email service for the customers that there simply wasn’t enough time to move across. After that, it was a nice orderly shutdown, which finished with the emergency lights running out of battery and leaving me in the dark.

Friday was sort out the chaos time. Within 10 minutes of main power being restored, uplinks and internal network were running. Within 30 minutes, cloud servers were operational, critical services were being transferred back, and customer virtual machines restarted. Within 2 hours, all virtual machines, all co-located servers were fully restored and operating.

While I wasn’t able to keep all services running through the night, all critical data center services, email and backup email, and critical customer services continued to run, all without a hiccup, thanks to the ability to easily move virtual machines from one data center to the other.

The best kind of disaster is the one you can avoid!

Call or email me or visit the SwiftWater Telecom web site for hurricane resistant green data center and cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

Announcement: Making reliable cloud computing MORE reliable!

Have you ever accidentally shut down a virtual or co-located server in a distant data center and then had to wait on hold listening to bad muzak for someone to turn it back up? Or maybe you’re one of the unfortunate ones that suffered through recent large scale outages at major cloud computing providers and then had to wait for hours or longer to get your virtual machine back up and running?

Well you don’t have to suffer with that anymore! SwiftWater Telecom is proud to announce the addition of the virtual machine watchdog service to our Aurora Resilient Cloud Service (powered by Xen Cloud Platform)!

Accidental shutdown? The watchdog restarts your virtual server.

Server failure? The watchdog restarts your virtual server.

Facility outage? When the facility is back, your virtual server is back, in minutes, not endless hours.

When the unexpected happens, depend on SwiftWater Telecom to minimize the impact to you, your business, and your customers. All that and no muzak!

Call or email me or visit the SwiftWater Telecom web site for green data center and cloud computing services minus the hype.

Vern

swiftwater telecom rcs cloud computing logo

5 steps to unfreezing the frozen data center.

Tonight I was reading about DC area data center restoration after the recent major snow storm. Would you know how to unfreeze your data center?

So, you finally manage to make it through the drifts with snowshoes and sled dogs, only to find a data center that’s dark, silent, and cold as a meat locker. How do you get going again without damaging anything worse than it already is?

1. Disconnect all power to the facility. Try to restore too much at once and you’ll be blowing breakers or damaging equipment if the commercial power is still unstable.

2. Restore major transformers first, then UPS or DC power plants, these are the major draw on the commercial power. Do not restore anything else until the UPS batteries or DC power plant battery string are fully charged. Remember, there are 3 parts to the data center power facilities, commercial AC, battery back up, and generators. Don’t start anything unless 2 of the 3 are available.

3. Before warming the data center up, inspect all weather exposed equipment (cooling, generators) to make sure there’s no snow infiltration and all vents, intakes, and exhausts are clear. Snow will clear a LOT easier when it’s powdery and dry, let it turn to water and you’re going to have a real problem. Apply power to a piece of mechanical equipment that has moving parts frozen or obstructed by snow and you’re going to have a worse problem. It’s also a good idea to avoid sucking snow inside via the cooling system.

4. Get the heat going. Make sure mechanical equipment is up to a reasonable temperature before energizing. Make sure generator starting batteries are warm and charged. Once the generators will run, then you can start restoring the IT equipment.

5. Start restoring equipment from the edge in (uplink equipment, edge routers first). Take the equipment one step at a time, shut down and come back to or diagnose anything that doesn’t start immediately. Never leave a piece of malfunctioning equipment powered up, you could end up with more heat than you expect when it toasts. Gradual restoration will also take it easy on the power systems.

Obviously this isn’t an exhaustive list of everything that would need attention in a frozen data center. Frozen pipes are a whole other chapter (it’s a good idea to shut the water off to frozen pipes, you don’t need a flood in the data center when they thaw with breaks in them.

Here’s hoping you never need to do this, but if you do, these 5 steps will help you get thawed out without making anything worse.

Call or email me today for data center engineering, operations, DC power plant engineering, or operations services for your data center!

Vern

swiftwater telecom logo

The virtual data center, dealing with a cloud overload.

This morning I was reading about gracefully degrading Web 2.0 applications in response to server overload issues. This is an important point to maintain your customer’s user experience, but it also ties in to the ability of the dedicated or cloud computing virtual servers to adapt to unforseen conditions.

There are any number of unpredictable things that can negatively impact the experience of customers using your web site, web app, or other Internet based service. Servers can fail, network links can fail, demand can spike drastically due to totally unpredictable influences, and, if you’re not careful, small failures can cascade into large ones. The important thing is to make the service resilient enough to either continue to operate through the failure or demand spike or to be flexible enough to be rearranged to adapt to the situation.

In general, I believe that having things operating in a degraded state is a far better response than having things unavailable at all. Customers hate uncertainty and an unavailable service is the height of uncertainty. Is the service down? Is it coming back? Is it a problem on the customer end?

One method of dealing with this is the way Google handles the routers that feed their Gmail service. When a router becomes slower than a set benchmark, it sheds its load to the other routers and marks the path “do not use”. This is a dangerous response, since shedding the load to other facilities can result in a runaway failure cascade, taking down path after path at increasing speed, until every path is offline from being overloaded as happened to Google in recent failures of its Gmail service. Now you not only have a broken service, you have a nightmare to restore the service without triggering the same cascade all over again.

The first response to overload should be to shed non-critical loads. It’s better to have individual features unavailable (with an appropriate message to the customer so they know what’s going on) while protecting the operation of the core services. This can even be done automatically on the server, based on error logs and load measurements.

The second response is where the flexibility of cloud computing shines. Starting new virtual servers on a cloud is easy, fast, and overloads can be diverted to them in minutes. The capability of cloud computing to absorb and quickly recover from failures also dramatically reduces the potential impact to site, especially if that site is built to take advantage of the flexibility of cloud computing with our high reliability web site package.

Remember, keeping some of the lights on is better than leaving your customers in the dark!

Cloud computing services

Vern

swiftwater telecom logo