Tag Archives: reliability

What you REALLY should be asking your cloud computing provider.


Vern Burke, SwiftWater Telecom
Biddeford, ME

I’ve been reading quite a bit of back and forth over cloud computing provider Service Level Agreements and endless checklists that purport to be everything a customer should ask a potential cloud computing provider. There’s a lot of aimless noise and sniping back and forth that is missing a very critical point.

Let me start of by saying, as an approach to insuring service uptime, classic percentage based SLAs are worthless. The SLA tells you nothing about whether you can expect your cloud computing provider to keep your service running or not, only the penalty if they fail. Your goal as a cloud computing customer isn’t to extract the maximum compensation for the failure of your cloud service, it’s to insure that your cloud service doesn’t fail in the first place!

Failures of data center hardware are an inevitable fact of life, even with the most conscientious engineering and operation. The data center failures of the past year have shown that even the big data centers fall well short of the “most conscientious engineering and operation” goal. Given this fact of life, here are the things you should REALLY be asking your cloud computing provider.

1. Do they have the ability to automatically restart workloads on the remaining running physical cloud capacity if part of the capacity fails.

This is something that has stood out like a sore thumb with a lot of the big cloud providers. Failures of underlying physical hardware on Amazon’s AWS service kills the workloads that were running on that hardware, even though the vast amount of capacity is still up and running in the cloud. If your cloud provider can’t automatically restart your workload in case of failure, run away.

In the SwiftWater Telecom cloud (using the Xen Cloud Control System) failed workloads are restarted automatically on remaining cloud capacity in minutes, not hours.

2. Do they offer a way to insure that redundant virtual servers never end up on the same physical server?

It doesn’t do much good to have redundant virtual servers in the cloud if they all die when one single physical host dies.

In the SwiftWater Telecom cloud, we offer what we call the Ultra Server. The Ultra Server combines redundant virtual servers with the “unfriend” feature of XCCS. Unfriending insures that the redundant servers never end up on the same physical server.

This means that nothing but a “smoking hole disaster” would ever cause a complete outage of an Ultra Server. Combine that with the automatic virtual server restoration of XCCS and the option to set up the Ultra Server between our separate data centers, and you have a cloud powered virtual server that is the next best thing to indestructible.

Stop concentrating on the penalties for failure, ask the right questions to find the cloud computing provider who has the right tools to keep your service from failing in the first place.

High availability, bulletproofing the cloud.


This year has seen some very high profile failures of major cloud computing providers. One of the things that stands out for me in these is an almost total inability to restore running customer workloads without major amounts of manual intervention, usually by the customer themselves.

Silly human caused outages aside, data center infrastructure is almost guaranteed to suffer outages, despite the best efforts to the contrary. physical equipment is fallible, all the way from the data center back up power generators to the fanciest server. This is magnified a lot by cloud computing, since the same infrastructure supports 10-20 times the amount of customers.

What’s giving cloud computing a bad name for reliability isn’t the failures, it’s the lousy response to them. Taking hours or even days to restore customer workloads when the cloud provider still has operating capacity (partial failure) is purely ridiculous. Expecting customers to monitor their virtuals themselves and deal with manually restarting them after a failure is guaranteed to make people even unhappier. This doesn’t even take into account the irritation at the service going down in the first place. I think there’s a WAY better way to handle this.

For quite a while now, our open source project, Xen Cloud Control System for the excellent Xen Cloud Platform, has featured the ability to automatically recover from most cloud failures. Your virtual machine stops running? XCCS automatically restarts it. A physical cloud host fails? XCCS restarts everything that was running on it on the other hosts. No muss, no fuss.

With the release of XCCS ver 0.5.4 today, we’ve introduced the ultimate automated feature to make sure the customer service stays up and running, no matter what. The new “unfriend” feature insures that two virtual servers who are unfriended will never be running on the same physical host server. This means that a partial failure of the cloud or data center infrastructure will NOT take out two redundant virtual servers. Combine this with the automatic restart facility of XCCS and the customer’s service doesn’t even sneeze at the failure.

Want the ultimate in bulletproof web servers? Take two virtual server load balancers, set up heartbeat for failover, and then unfriend them. Add two (or more) virtual web servers and unfriend them. Now you have a self healing web server that will NOT go down for anything short of a total smoking hole disaster. Completely automatic disaster recovery, courtesy of the RELIABLE cloud. This is the way cloud computing should be.

Call me at 207-399-7108 or email me today for bulletproof virtual servers that you can count on to stay up and running through anything!

Vern

swiftwater telecom rcs cloud computing logo

The data center in review, top 10 bozos of the year, 2009!


My coined term from my data center analysis and commentary this year: bozosity. Bozosity is a condition brought on by an imbalance of invisible particles known as bozons. This condition causes otherwise competent and sensible people to do incomprehensibly boneheaded things.

The Winners of the 2009 Data Center Bozo Awards are:

1. Microsoft and Danger for the T-Mobile Sidekick data loss debacle. MicroDanger did not win for operating critical storage systems without backups, but for the handling of the aftermath. MicroDanger immediately announced all data was lost, then, by the time they did recover most of the data, significant damage was done to T-Mobile and the sidekick, leaving everyone involved with a reputation for incompetence.

2. Fisher Plaza for knocking out major Internet services by blowing up an antquated, obsolete, and improperly maintained electrical system in their data center building. Aluminum bus bars are evil, k?

3. IBM for blowing Air New Zealand out of the water by performing power work during peak load period of one of Air New Zealand’s busiest travel weekends, unnecessarily running ANZ’s mainframe from a fallible generator alone, and taking an inordinate amount of time to restore service.

4. IBM for allowing a state of Texas elections commission storage system in their care to fail because it wasn’t in the contract to back it up.

5. Google for their brilliant example of cascading failure by sequentially overloading every router feeding their Gmail service.

6. Research in Motion for seeing how many BlackBerry back end failures they could squeeze in before the end of the year.

7. Amazon, Rackspace, Google, and a number of other providers who managed to blacken the term cloud computing by multiple reliability problems, most of which were self inflicted. Thanks a heap guys.

8. DreamHost for giving us a shining example of how NOT to do a major data center migration.

9. The people who operate Sweden’s top level DNS domain for turning lose an untested script and taking the entire thing down. Who knew a few missing dots could be so much trouble?

10.The Perth iX data center in western Australia for allowing a smoldering mulch bed outside the building to shut down the entire data center because they couldn’t locate a miniscule amount of smoke that was infiltrating the building and setting off an overly sensitive detection system.

Finally, I’d like to add a “dishonorable mention” award to FedEx for turning overnight delivery of a critical part into 3 days and nearly sticking me with working in the data center overnight Christmas Eve.

Looks like we survived the year but it sure wasn’t pretty.

Vern, SwiftWater Telecom

The reliable data center cloud.


Todays post comes from reading an article about cloud computing and data center efficiency. I’m going to talk about one of the issues touched on, reliability.

The cloud implementations of some of the early major players have had large and public failures, in some cases, repeated. The distributed nature of the data center cloud lends itself to a number of fairly simple steps to insure reliability.

So, how do we help prevent data center cloud burps? The first thing is to segment the infrastructure the cloud depends on. Split power feeds between fairly small segments of the cloud to reduce the impact of power problems.

Segment the cloud network and provide redundant network paths, especially in the storage network. The storage network is the heart of the cloud and it has to be as rock solid and fault tolerant as possible.

Cooling is another area to segment. Is there a pattern showing here? The secret to data center cloud reliability is to segment so that any infrastructure failure impacts as small an amount of the cloud as possible. Minimizing the impact allows the cloud to be restored to full operation in case of failure. Operating degraded is also far better than not operating at all!

The other advantage of segmenting is to reduce the possibility of cascading failures. Since the cloud is so interconnected, any failure that ripples through the cloud would be catastrophic. Now you’ve gone from an almost unnoticeable impact to major impact and extensive restore time.

The nature of the data center cloud not only increase energy efficiency but also “should” benefit reliability. Individual failures will still happen but the overall effect on the cloud will be limited.

So, why the high profile cloud failures? Lack of storage redudancy, inviting cascading failures (multiple Google outages), inviting human error into things like power without segmentation (short a PDU and half the data center chokes).

Sometimes, the best even the big boys can do is serve as a bad example.

Vern, SwiftWater Telecom

IBM data center performs face plant in New Zealand


IBM has turned in another pathetic performance, blowing Air New Zealand’s computer systems right out of the water (and here). The inability of data center providers to keep the lights on is getting ridiculous.

According to the stories, the root cause of the outage was a generator failure during maintenance work. Now it’s one thing to have an unaticipated failure in operation but having a failure during maintenance that requires 6 hours to resolve is just poor contingency planning. Imagine how happy the customer will be when they know they were down 6 hours because of something you specifically did.

It seems like the power competency of data center operators is getting worse and worse, given the string of high profile catastrophic failures this year. Everything from human screwups (The Planet) to disintegrating infrastructure (Fisher Plaza).

The trend that concerns me the most is towards cheaper, “micro run time” power back up such as flywheel. Where traditional UPS systems typically run 30 minutes, flywheel systems run less than 1 minute. They certainly work fine if everything goes right, but if anything goes wrong, there’s zero time to deal with it, you’re going down uncontrolled.
Granted, the flywheel UPS may be more green, since it doesn’t require lead the way battery strings do, but the one shot or nothing operation of them is seriously risky.

We take a different approach on this. We combine the green benefits to data center DC power combined with battery strings big enough to provide hours of run time, not minutes or seconds. It simply isn’t worth taking the risk.

Sometimes things go so bad there isn’t anything you can do, but at least give yourself a chance. It will put you ahead of the big boys.

Vern, SwiftWater Telecom

data center, web hosting, Internet engineering