Tag Archives: Amazon

The cloud computing service contract red herring.


I’ve just been reading about “legal issues” surrounding cloud computing provider service contracts. There’s a great deal of common sense being missed in this way overhyped issue.

The first wail that comes from these people is “but they won’t negotiate the liability in the contract!”. This means (to them) that you should run away as fast as possible. This is ridiculous. Any provider of services sets their contract terms vs their price as part of the service. Any renegotiation of terms is going to mean a change in price, so you can’t expect to get someone to assume high liability for no money.

The second point to this is that some providers simply may not want to do those kind of terms because that’s not the business they’re in. This isn’t a sign that there’s something wrong with the provider or the product, it’s a sign that the service is the wrong thing for the customer needs. The provider shouldn’t be ripped for declining to modify product and business model for a customer who is just a bad fit.

Next up is the idea that the cloud computing provider should take all or most of the liability for security in “the cloud”. The problem is that this assumes that all variants of cloud computing are the same. In IaaS (infrastructure as a service), the cloud provider sells the customer a virtual machine on the cloud. The customer has full control over the virtual machine and the cloud provider has virtually none. In this case, the cloud provider could be expected to be responsible for the integrity of the underlying cloud and storage platforms, but nothing else. You wouldn’t rent an unmanaged dedicated server and expect the data center provider to be responsible if you mess up administrating it, why should IaaS be any different?

As you go up the scale, cloud provider responsibility grows. With PaaS (platform as a service), the cloud provider is responsible for the intergrity of the underlying infrastructure and the platform itself, but not for what the customer does with it. With SaaS (software as a service), the cloud provider is responsible for everything right up to the customer facing application itself.

This and the understanding that assuming more risk means more money should make make this a less complicated issue to resolve.

Oh, by the way, if you can’t find anyone that will give you the terms you want at the price you want, this should be a clue that what you’re trying to do isn’t reasonable.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

Does your cloud soar like an eagle? #cloudcomputing


From what I can see by reading this article that the IBM analysis of Amazon finds that cloud computing support forums are lacking, it’s more like limping along like a ruptured duck.

First is the idea that it can take 10 hours or more to just get a response to a problem (unless you shell out for the premium support) and possibly days to resolve. I don’t know any of my customers that would stand for this kind of turtle response when they have a problem. I don’t know if EC2 customers are just running exceptionally low priority work or if the cloud has just become commoditized to the point where people are accepting this as normal. My customers would be lined up at the door with torches and pitchforks.

Next we get this wonderful quote:

“… it typically does not provide guarantees on individual instance availability. Users should expect that the provided virtual resources may become unavailable at times, thus requiring users to restart their application instances on a new server.”

In other words, any particular part of the cloud may disappear without warning at any time and it’s up to the end customer to monitor and restart their own virtual on other running facilities. It’s bad enough to tick off customers by crashing the infrastructure but to make them responsible for their own manual disaster recovery is rather pathetic especially considering the opportunities for automation in the cloud, which apparently Amazon doesn’t take advantage of.

Then there’s the note that 166 administrators were involved in problem resolution but that most of the problems were resolved by just 10 administrators. That means that 94% of the administrators apparently weren’t pulling their weight. Anyone else see the problem with that?

So, how do you fix this, just what should cloud computing customers really expect? Well, beyond the obvious of decent response and resolution times, cloud automation holds the key.

Our own cloud service is based on the excellent Xen Cloud Platform, on top of which we added our own Xen Cloud Control System operations software. As part of this package, we have 2 watchdogs (the virtual machine watchdog and the host watchdog) plus the load balancer.

Accidently shut down a virtual machine and the VM watchdog restarts it automatically. Lose a host server and the host watchdog detaches the down host and the VM watchdog restarts all virtual machines that were running on that host. The load balancer automatically adjusts the cloud workloads for all of these changes. 15 minutes maximum to restore any virtual machine and it happens without any manual intervention from the customer or a cloud operator.

This is what the customer should expect their cloud to be, not a ruptured duck (or a turkey).

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

Keeping the cloud flying. #cloudcomputing


I was just reading an article by David Linthicum about combating cloud outages. While not really wrong, I think it misses the point about what it takes to really keep a cloud up and flying.

The “core issue” with cloud computing failures is NOT overcapacity. A quick look at major failures over this year and last shows everything from human screwups (Google Gmail failure from botched maintenance work in 2009) to endless lists of power related problems (4 failures of Amazon’s EC2 service in one week) to, yes, over capacity issues (Google AppEngine, repeated Twitter failures).

The human caused cloud failures have been especially confounding. Failures of untested software, power equipment installed without even a bother to check the configuration (one of the recent Amazon EC2 failures), the list of incompetent engineering and operation incidents is astonishing.

So what is the real core issue with cloud computing failures? Aside from the obvious screw ups and foul ups, the real issue is the magnifying effect of the cloud. The increased “population density” on the same hardware magnifies the effect of any failure.

Power fail one dedicated server back in the days P.C. (pre cloud) and you took out one customer. Power fail a single cloud server and now you’ve knocked out 10 (or far more) customers. The failure modes aren’t significantly different in a cloud, the magnitude of the effect is.

So what is the solution?

1. Meticulous attention to detail in constructing, engineering, and operating the cloud. Take the human goofs out of the equation.

2. Never ever ever load any software into the cloud itself that hasn’t been tested thoroughly. This should be obvious but for some reason it isn’t (this is why we operate a “stunt cloud” in addition to the production cloud).

3. Segment the cloud infrastructure (power). No attention to detail is ever going to be perfect so minimize the amount of the cloud a failure can take out.

4. Automate, automate, automate. Rebalance workloads to compensate for down infrastructure and detach down hosts and restart their workloads on running hosts, automatically.

On our cloud computing service, anything short of a smoking hole disaster automatically starts restoring workloads in 5 minutes with an absolute maximum of 15 minutes to all restored. Compare this to the 7+ hours restore times for Amazon EC2 outages.

Notice I didn’t say anything about capacity here. Adding capacity to a cloud is one of the easiest and fastest things to do (we go from bare server to loaded and operating in the cloud in 10 minutes or less).

The real key to keeping the cloud flying is to minimize the goofs, limit the effect of the goofs, and automate a lightning fast response to the goofs that make it by despite the best intentions.

Cloud failures happen, it’s the response to them that makes the difference.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

#datacenter failures and restoring a cloud #cloudcomputing


I’ve been reading lately about the multiple recent failures that have plagued Amazon’s EC2 cloud computing service (4 in one week!). I’m not going to address the ongoing power reliability problems (I was amuse to see a recent YouTube video touting how reliable Amazon is) but I’m going to talk about cloud failure recovery.

The failure modes in a cloud computing system can include inadvertant virtual machine shutdown, partial failure of connectivity or operation of the cloud (likely due to data center failures such as Amazon’s power failures), or total failure of the entire cloud (the smoking hole disaster). Failures will always happen but taking 7 hours to restore customer virtual machines (as Amazon has done) is pathetic.

In our own cloud, based on the excellent Xen Cloud Platform, we’ve created a fully automated front end package called Xen Cloud Control System. As part of XCCS, we’ve included 3 pieces that automate the response to these type of cloud failures.

The virtual machine watchdog finds and restarts virtual machines that get unintentionally shut down, automatically, without any need for a report from the customer. Remember, detecting and resolving a failure before the customer squawks is the best policy.

The host watchdog detects and clears failed host servers so that the virtual machine watchdog can restart the virtual machines. This gets the virtual machines restarted on another running host, instead of waiting for the failed host to be repaired as Amazon seems to require.

The final piece of the puzzle is the load balancer. The load balancer automatically readjusts the cloud, compensating for failures.

These 3 pieces not only keep the cloud running as smoothly as possible, they insure the maximum time to restore customer virtual machines is 15 minutes as long as any of the cloud is still running. This is also all done automatically, no operator intervention required.

I have no idea what the heck Amazon is doing, but if the worst happens, 15 minutes to restore service on a cloud beats the stuffing out of 7 hours.

Vern

Tuesday data center tidbits: “random” Amazon data center failures and more


I’ve just been reading an article an article about whether the recent data center power failures of Amazon’s EC2 cloud computing service are really “random”. Yes, it’s true that every data center, no matter how good, can have service affecting failures. However, there’s a difference between having a bolt of lightning comes down, knock out the commercial power, and fry your generator at the same time, vs have a transfer switch fail because nobody bothered to verify the configuration of it when it was put in. Rank carelessness is rank carelessness and you can’t excuse it by noting other providers have problems too.

Today’s “duh” comment of the day from a piece about data center cooling:

“It actually takes more energy to make the air 55 degrees than it does to make it 65 degrees.”

Um, if you didn’t know that it takes more energy to cool the air more, run, don’t walk, to the nearest co-location provider and hand your servers over to someone who knows what they’re doing.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

Friday data center tidbits: Amazon bozos #cloudcomputing (again)(and again and again)


Today I’ve been reading more detail into yet another recent data center power failure of the Amazon EC2 cloud computing service. There isn’t much they could do about having a car crash into a utility pole, but the rest of it is purely ridiculous.

1. The transfer switch that failed came with a different default configuration than their regular ones. What kind of a bonehead installs a critical piece of gear like a transfer switch (or any other piece of data center infrastructure equipment) without checking the configuration? Operate like this and you deserve what you get.

2. This is a perfect example why trying to save money with absolutely minimum run time power back up is a dumb idea. Sure, if everything goes perfectly, 15 seconds to get a generator started is fine. Gambling that you aren’t going to need enough run time to deal with a problem is a bad risk.

3. Excusing this by saying only some customers were affected does not inspire confidence. Just how many more of these misconfigured transfer switches are out there waiting to screw up?

This is the first time I’ve ever had anyone double up on data center bozo of the week. I’ll take suggestions from the floor on what would be an appropriate award for this dubious honor.

Email or call me or visit the SwiftWater Telecom web site for green and reliable data center DC power, bozo free!

Vern

Monday data center tidbits:Amazon bozos cloud computing (again)


I’ve just been reading about the latest failure of Amazon’s EC2 cloud computing service. This is screwed up on so many levels.

1. Would you REALLY run operating servers from just a generator alone? Did Amazon learn nothing from the IBM debacle that blew Air New Zealand out of the water?

2. Power segmentation in the data center serving cloud computing hosts is critical. The same failure that used to take down one customer now takes down 10 (or more).

3. Taking 7 hours to restore virtual machines (instances) is ridiculous.

It appears to me that Amazon instances can’t be restarted on anything except the physical hosts they were running on. This flies in the face of the very idea of cloud computing.

In our cloud computing service, virtual machine and host watchdogs monitor for failures of virtual machines and physical hosts. Virtual machines that fail are automatically restarted, physical cloud hosts that fail are automatically cleared and their virtual machines restarted on a running host, maximum time to restore customer virtual machines (short of a total meltdown) is 15 minutes. 7 hours is pathetic.

Data center bozos of the week goes out to Amazon (your multicolored wigs and squeaky noses are in the mail, guys!).

Email or call me or visit the SwiftWater Telecom web site for cloud computing services, bozo free!

Vern