Tag Archives: EC2

Does your cloud soar like an eagle? #cloudcomputing

From what I can see by reading this article that the IBM analysis of Amazon finds that cloud computing support forums are lacking, it’s more like limping along like a ruptured duck.

First is the idea that it can take 10 hours or more to just get a response to a problem (unless you shell out for the premium support) and possibly days to resolve. I don’t know any of my customers that would stand for this kind of turtle response when they have a problem. I don’t know if EC2 customers are just running exceptionally low priority work or if the cloud has just become commoditized to the point where people are accepting this as normal. My customers would be lined up at the door with torches and pitchforks.

Next we get this wonderful quote:

“… it typically does not provide guarantees on individual instance availability. Users should expect that the provided virtual resources may become unavailable at times, thus requiring users to restart their application instances on a new server.”

In other words, any particular part of the cloud may disappear without warning at any time and it’s up to the end customer to monitor and restart their own virtual on other running facilities. It’s bad enough to tick off customers by crashing the infrastructure but to make them responsible for their own manual disaster recovery is rather pathetic especially considering the opportunities for automation in the cloud, which apparently Amazon doesn’t take advantage of.

Then there’s the note that 166 administrators were involved in problem resolution but that most of the problems were resolved by just 10 administrators. That means that 94% of the administrators apparently weren’t pulling their weight. Anyone else see the problem with that?

So, how do you fix this, just what should cloud computing customers really expect? Well, beyond the obvious of decent response and resolution times, cloud automation holds the key.

Our own cloud service is based on the excellent Xen Cloud Platform, on top of which we added our own Xen Cloud Control System operations software. As part of this package, we have 2 watchdogs (the virtual machine watchdog and the host watchdog) plus the load balancer.

Accidently shut down a virtual machine and the VM watchdog restarts it automatically. Lose a host server and the host watchdog detaches the down host and the VM watchdog restarts all virtual machines that were running on that host. The load balancer automatically adjusts the cloud workloads for all of these changes. 15 minutes maximum to restore any virtual machine and it happens without any manual intervention from the customer or a cloud operator.

This is what the customer should expect their cloud to be, not a ruptured duck (or a turkey).

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.


swiftwater telecom rcs cloud computing logo


Keeping the cloud flying. #cloudcomputing

I was just reading an article by David Linthicum about combating cloud outages. While not really wrong, I think it misses the point about what it takes to really keep a cloud up and flying.

The “core issue” with cloud computing failures is NOT overcapacity. A quick look at major failures over this year and last shows everything from human screwups (Google Gmail failure from botched maintenance work in 2009) to endless lists of power related problems (4 failures of Amazon’s EC2 service in one week) to, yes, over capacity issues (Google AppEngine, repeated Twitter failures).

The human caused cloud failures have been especially confounding. Failures of untested software, power equipment installed without even a bother to check the configuration (one of the recent Amazon EC2 failures), the list of incompetent engineering and operation incidents is astonishing.

So what is the real core issue with cloud computing failures? Aside from the obvious screw ups and foul ups, the real issue is the magnifying effect of the cloud. The increased “population density” on the same hardware magnifies the effect of any failure.

Power fail one dedicated server back in the days P.C. (pre cloud) and you took out one customer. Power fail a single cloud server and now you’ve knocked out 10 (or far more) customers. The failure modes aren’t significantly different in a cloud, the magnitude of the effect is.

So what is the solution?

1. Meticulous attention to detail in constructing, engineering, and operating the cloud. Take the human goofs out of the equation.

2. Never ever ever load any software into the cloud itself that hasn’t been tested thoroughly. This should be obvious but for some reason it isn’t (this is why we operate a “stunt cloud” in addition to the production cloud).

3. Segment the cloud infrastructure (power). No attention to detail is ever going to be perfect so minimize the amount of the cloud a failure can take out.

4. Automate, automate, automate. Rebalance workloads to compensate for down infrastructure and detach down hosts and restart their workloads on running hosts, automatically.

On our cloud computing service, anything short of a smoking hole disaster automatically starts restoring workloads in 5 minutes with an absolute maximum of 15 minutes to all restored. Compare this to the 7+ hours restore times for Amazon EC2 outages.

Notice I didn’t say anything about capacity here. Adding capacity to a cloud is one of the easiest and fastest things to do (we go from bare server to loaded and operating in the cloud in 10 minutes or less).

The real key to keeping the cloud flying is to minimize the goofs, limit the effect of the goofs, and automate a lightning fast response to the goofs that make it by despite the best intentions.

Cloud failures happen, it’s the response to them that makes the difference.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.


swiftwater telecom rcs cloud computing logo

#datacenter failures and restoring a cloud #cloudcomputing

I’ve been reading lately about the multiple recent failures that have plagued Amazon’s EC2 cloud computing service (4 in one week!). I’m not going to address the ongoing power reliability problems (I was amuse to see a recent YouTube video touting how reliable Amazon is) but I’m going to talk about cloud failure recovery.

The failure modes in a cloud computing system can include inadvertant virtual machine shutdown, partial failure of connectivity or operation of the cloud (likely due to data center failures such as Amazon’s power failures), or total failure of the entire cloud (the smoking hole disaster). Failures will always happen but taking 7 hours to restore customer virtual machines (as Amazon has done) is pathetic.

In our own cloud, based on the excellent Xen Cloud Platform, we’ve created a fully automated front end package called Xen Cloud Control System. As part of XCCS, we’ve included 3 pieces that automate the response to these type of cloud failures.

The virtual machine watchdog finds and restarts virtual machines that get unintentionally shut down, automatically, without any need for a report from the customer. Remember, detecting and resolving a failure before the customer squawks is the best policy.

The host watchdog detects and clears failed host servers so that the virtual machine watchdog can restart the virtual machines. This gets the virtual machines restarted on another running host, instead of waiting for the failed host to be repaired as Amazon seems to require.

The final piece of the puzzle is the load balancer. The load balancer automatically readjusts the cloud, compensating for failures.

These 3 pieces not only keep the cloud running as smoothly as possible, they insure the maximum time to restore customer virtual machines is 15 minutes as long as any of the cloud is still running. This is also all done automatically, no operator intervention required.

I have no idea what the heck Amazon is doing, but if the worst happens, 15 minutes to restore service on a cloud beats the stuffing out of 7 hours.


Tuesday data center tidbits: “random” Amazon data center failures and more

I’ve just been reading an article an article about whether the recent data center power failures of Amazon’s EC2 cloud computing service are really “random”. Yes, it’s true that every data center, no matter how good, can have service affecting failures. However, there’s a difference between having a bolt of lightning comes down, knock out the commercial power, and fry your generator at the same time, vs have a transfer switch fail because nobody bothered to verify the configuration of it when it was put in. Rank carelessness is rank carelessness and you can’t excuse it by noting other providers have problems too.

Today’s “duh” comment of the day from a piece about data center cooling:

“It actually takes more energy to make the air 55 degrees than it does to make it 65 degrees.”

Um, if you didn’t know that it takes more energy to cool the air more, run, don’t walk, to the nearest co-location provider and hand your servers over to someone who knows what they’re doing.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.


swiftwater telecom rcs cloud computing logo

Friday data center tidbits: Amazon bozos #cloudcomputing (again)(and again and again)

Today I’ve been reading more detail into yet another recent data center power failure of the Amazon EC2 cloud computing service. There isn’t much they could do about having a car crash into a utility pole, but the rest of it is purely ridiculous.

1. The transfer switch that failed came with a different default configuration than their regular ones. What kind of a bonehead installs a critical piece of gear like a transfer switch (or any other piece of data center infrastructure equipment) without checking the configuration? Operate like this and you deserve what you get.

2. This is a perfect example why trying to save money with absolutely minimum run time power back up is a dumb idea. Sure, if everything goes perfectly, 15 seconds to get a generator started is fine. Gambling that you aren’t going to need enough run time to deal with a problem is a bad risk.

3. Excusing this by saying only some customers were affected does not inspire confidence. Just how many more of these misconfigured transfer switches are out there waiting to screw up?

This is the first time I’ve ever had anyone double up on data center bozo of the week. I’ll take suggestions from the floor on what would be an appropriate award for this dubious honor.

Email or call me or visit the SwiftWater Telecom web site for green and reliable data center DC power, bozo free!


Monday data center tidbits:Amazon bozos cloud computing (again)

I’ve just been reading about the latest failure of Amazon’s EC2 cloud computing service. This is screwed up on so many levels.

1. Would you REALLY run operating servers from just a generator alone? Did Amazon learn nothing from the IBM debacle that blew Air New Zealand out of the water?

2. Power segmentation in the data center serving cloud computing hosts is critical. The same failure that used to take down one customer now takes down 10 (or more).

3. Taking 7 hours to restore virtual machines (instances) is ridiculous.

It appears to me that Amazon instances can’t be restarted on anything except the physical hosts they were running on. This flies in the face of the very idea of cloud computing.

In our cloud computing service, virtual machine and host watchdogs monitor for failures of virtual machines and physical hosts. Virtual machines that fail are automatically restarted, physical cloud hosts that fail are automatically cleared and their virtual machines restarted on a running host, maximum time to restore customer virtual machines (short of a total meltdown) is 15 minutes. 7 hours is pathetic.

Data center bozos of the week goes out to Amazon (your multicolored wigs and squeaky noses are in the mail, guys!).

Email or call me or visit the SwiftWater Telecom web site for cloud computing services, bozo free!


Cloud computing not vaporizing anytime soon.

I just read a provocative piece about why cloud computing will vaporize. I think the oracle is out to lunch on this one.

First of all, we get an apples to oranges comparison to a technology that couldn’t deliver on its hype. Over priced and not solving any problem is certainly a recipe for failure, that’s for sure.

On the other hand, we find that cloud computing is pretty much delivering on everything promised. Energy saving (making it a critical component of any green data center plan) , cost saving, highly flexible, easy to deploy virtual servers on it, highly reliable (if done right), disaster recovery and avoidance, there’s a ton of benefits to cloud computing, all being proven in real action.

The next assertion is that you can’t build a sustainable business selling capacity unless you have a “distinct advantage”. I’m sure all the data center co-location, hosting, and dedicated server providers out there are surprised to know this. The infrastructure isn’t just going to appear magically and most of the people using it are NOT going to be building their own. I don’t believe the market for cloud capacity is just going to settle out on the 900lb gorillas of giant corporations, any more than the traditional data center market did. There’s certainly plenty of little guys making it out there.

In general, the largest cloud computing providers are the easiest to differentiate against. Amazon, Google, Microsoft, Rackspace, IBM, and others all have had repeated major self inflicted cloud outages in 2009 with major impact to their customers and just awful customer service response to them. The easiest differentiator is doing it well when the competition is screwing it up royally. It’s actually easy to remove the big boy’s differentiators, since I can easily build my own cloud with Amazon’s EC2/S3 (Eucalyptus) interface or Google’s App Engine (AppScale) interface.

As for there being a “zillion” cloud computing providers out there, I’ll believe it when I see proof of the number. Ones who do it well will survive, ones who do it poorly will not, just the way it’s always been. I don’t believe that only people who build the apps also will survive. I do believe you have to educate people how to apply cloud computing and that’s what I do with our cloud.

In light of the incredible demand, I’m not giving up on our cloud anytime soon. I think the original author needs something besides goat entrails before making his next prediction.


swiftwater telecom cloud computing logo