Tag Archives: Amazon

The cloud computing service contract red herring.


I’ve just been reading about “legal issues” surrounding cloud computing provider service contracts. There’s a great deal of common sense being missed in this way overhyped issue.

The first wail that comes from these people is “but they won’t negotiate the liability in the contract!”. This means (to them) that you should run away as fast as possible. This is ridiculous. Any provider of services sets their contract terms vs their price as part of the service. Any renegotiation of terms is going to mean a change in price, so you can’t expect to get someone to assume high liability for no money.

The second point to this is that some providers simply may not want to do those kind of terms because that’s not the business they’re in. This isn’t a sign that there’s something wrong with the provider or the product, it’s a sign that the service is the wrong thing for the customer needs. The provider shouldn’t be ripped for declining to modify product and business model for a customer who is just a bad fit.

Next up is the idea that the cloud computing provider should take all or most of the liability for security in “the cloud”. The problem is that this assumes that all variants of cloud computing are the same. In IaaS (infrastructure as a service), the cloud provider sells the customer a virtual machine on the cloud. The customer has full control over the virtual machine and the cloud provider has virtually none. In this case, the cloud provider could be expected to be responsible for the integrity of the underlying cloud and storage platforms, but nothing else. You wouldn’t rent an unmanaged dedicated server and expect the data center provider to be responsible if you mess up administrating it, why should IaaS be any different?

As you go up the scale, cloud provider responsibility grows. With PaaS (platform as a service), the cloud provider is responsible for the intergrity of the underlying infrastructure and the platform itself, but not for what the customer does with it. With SaaS (software as a service), the cloud provider is responsible for everything right up to the customer facing application itself.

This and the understanding that assuming more risk means more money should make make this a less complicated issue to resolve.

Oh, by the way, if you can’t find anyone that will give you the terms you want at the price you want, this should be a clue that what you’re trying to do isn’t reasonable.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

Does your cloud soar like an eagle? #cloudcomputing


From what I can see by reading this article that the IBM analysis of Amazon finds that cloud computing support forums are lacking, it’s more like limping along like a ruptured duck.

First is the idea that it can take 10 hours or more to just get a response to a problem (unless you shell out for the premium support) and possibly days to resolve. I don’t know any of my customers that would stand for this kind of turtle response when they have a problem. I don’t know if EC2 customers are just running exceptionally low priority work or if the cloud has just become commoditized to the point where people are accepting this as normal. My customers would be lined up at the door with torches and pitchforks.

Next we get this wonderful quote:

“… it typically does not provide guarantees on individual instance availability. Users should expect that the provided virtual resources may become unavailable at times, thus requiring users to restart their application instances on a new server.”

In other words, any particular part of the cloud may disappear without warning at any time and it’s up to the end customer to monitor and restart their own virtual on other running facilities. It’s bad enough to tick off customers by crashing the infrastructure but to make them responsible for their own manual disaster recovery is rather pathetic especially considering the opportunities for automation in the cloud, which apparently Amazon doesn’t take advantage of.

Then there’s the note that 166 administrators were involved in problem resolution but that most of the problems were resolved by just 10 administrators. That means that 94% of the administrators apparently weren’t pulling their weight. Anyone else see the problem with that?

So, how do you fix this, just what should cloud computing customers really expect? Well, beyond the obvious of decent response and resolution times, cloud automation holds the key.

Our own cloud service is based on the excellent Xen Cloud Platform, on top of which we added our own Xen Cloud Control System operations software. As part of this package, we have 2 watchdogs (the virtual machine watchdog and the host watchdog) plus the load balancer.

Accidently shut down a virtual machine and the VM watchdog restarts it automatically. Lose a host server and the host watchdog detaches the down host and the VM watchdog restarts all virtual machines that were running on that host. The load balancer automatically adjusts the cloud workloads for all of these changes. 15 minutes maximum to restore any virtual machine and it happens without any manual intervention from the customer or a cloud operator.

This is what the customer should expect their cloud to be, not a ruptured duck (or a turkey).

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

Keeping the cloud flying. #cloudcomputing


I was just reading an article by David Linthicum about combating cloud outages. While not really wrong, I think it misses the point about what it takes to really keep a cloud up and flying.

The “core issue” with cloud computing failures is NOT overcapacity. A quick look at major failures over this year and last shows everything from human screwups (Google Gmail failure from botched maintenance work in 2009) to endless lists of power related problems (4 failures of Amazon’s EC2 service in one week) to, yes, over capacity issues (Google AppEngine, repeated Twitter failures).

The human caused cloud failures have been especially confounding. Failures of untested software, power equipment installed without even a bother to check the configuration (one of the recent Amazon EC2 failures), the list of incompetent engineering and operation incidents is astonishing.

So what is the real core issue with cloud computing failures? Aside from the obvious screw ups and foul ups, the real issue is the magnifying effect of the cloud. The increased “population density” on the same hardware magnifies the effect of any failure.

Power fail one dedicated server back in the days P.C. (pre cloud) and you took out one customer. Power fail a single cloud server and now you’ve knocked out 10 (or far more) customers. The failure modes aren’t significantly different in a cloud, the magnitude of the effect is.

So what is the solution?

1. Meticulous attention to detail in constructing, engineering, and operating the cloud. Take the human goofs out of the equation.

2. Never ever ever load any software into the cloud itself that hasn’t been tested thoroughly. This should be obvious but for some reason it isn’t (this is why we operate a “stunt cloud” in addition to the production cloud).

3. Segment the cloud infrastructure (power). No attention to detail is ever going to be perfect so minimize the amount of the cloud a failure can take out.

4. Automate, automate, automate. Rebalance workloads to compensate for down infrastructure and detach down hosts and restart their workloads on running hosts, automatically.

On our cloud computing service, anything short of a smoking hole disaster automatically starts restoring workloads in 5 minutes with an absolute maximum of 15 minutes to all restored. Compare this to the 7+ hours restore times for Amazon EC2 outages.

Notice I didn’t say anything about capacity here. Adding capacity to a cloud is one of the easiest and fastest things to do (we go from bare server to loaded and operating in the cloud in 10 minutes or less).

The real key to keeping the cloud flying is to minimize the goofs, limit the effect of the goofs, and automate a lightning fast response to the goofs that make it by despite the best intentions.

Cloud failures happen, it’s the response to them that makes the difference.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

#datacenter failures and restoring a cloud #cloudcomputing


I’ve been reading lately about the multiple recent failures that have plagued Amazon’s EC2 cloud computing service (4 in one week!). I’m not going to address the ongoing power reliability problems (I was amuse to see a recent YouTube video touting how reliable Amazon is) but I’m going to talk about cloud failure recovery.

The failure modes in a cloud computing system can include inadvertant virtual machine shutdown, partial failure of connectivity or operation of the cloud (likely due to data center failures such as Amazon’s power failures), or total failure of the entire cloud (the smoking hole disaster). Failures will always happen but taking 7 hours to restore customer virtual machines (as Amazon has done) is pathetic.

In our own cloud, based on the excellent Xen Cloud Platform, we’ve created a fully automated front end package called Xen Cloud Control System. As part of XCCS, we’ve included 3 pieces that automate the response to these type of cloud failures.

The virtual machine watchdog finds and restarts virtual machines that get unintentionally shut down, automatically, without any need for a report from the customer. Remember, detecting and resolving a failure before the customer squawks is the best policy.

The host watchdog detects and clears failed host servers so that the virtual machine watchdog can restart the virtual machines. This gets the virtual machines restarted on another running host, instead of waiting for the failed host to be repaired as Amazon seems to require.

The final piece of the puzzle is the load balancer. The load balancer automatically readjusts the cloud, compensating for failures.

These 3 pieces not only keep the cloud running as smoothly as possible, they insure the maximum time to restore customer virtual machines is 15 minutes as long as any of the cloud is still running. This is also all done automatically, no operator intervention required.

I have no idea what the heck Amazon is doing, but if the worst happens, 15 minutes to restore service on a cloud beats the stuffing out of 7 hours.

Vern

Tuesday data center tidbits: “random” Amazon data center failures and more


I’ve just been reading an article an article about whether the recent data center power failures of Amazon’s EC2 cloud computing service are really “random”. Yes, it’s true that every data center, no matter how good, can have service affecting failures. However, there’s a difference between having a bolt of lightning comes down, knock out the commercial power, and fry your generator at the same time, vs have a transfer switch fail because nobody bothered to verify the configuration of it when it was put in. Rank carelessness is rank carelessness and you can’t excuse it by noting other providers have problems too.

Today’s “duh” comment of the day from a piece about data center cooling:

“It actually takes more energy to make the air 55 degrees than it does to make it 65 degrees.”

Um, if you didn’t know that it takes more energy to cool the air more, run, don’t walk, to the nearest co-location provider and hand your servers over to someone who knows what they’re doing.

Email or call me or visit the SwiftWater Telecom web site for cloud computing services.

Vern

swiftwater telecom rcs cloud computing logo

Friday data center tidbits: Amazon bozos #cloudcomputing (again)(and again and again)


Today I’ve been reading more detail into yet another recent data center power failure of the Amazon EC2 cloud computing service. There isn’t much they could do about having a car crash into a utility pole, but the rest of it is purely ridiculous.

1. The transfer switch that failed came with a different default configuration than their regular ones. What kind of a bonehead installs a critical piece of gear like a transfer switch (or any other piece of data center infrastructure equipment) without checking the configuration? Operate like this and you deserve what you get.

2. This is a perfect example why trying to save money with absolutely minimum run time power back up is a dumb idea. Sure, if everything goes perfectly, 15 seconds to get a generator started is fine. Gambling that you aren’t going to need enough run time to deal with a problem is a bad risk.

3. Excusing this by saying only some customers were affected does not inspire confidence. Just how many more of these misconfigured transfer switches are out there waiting to screw up?

This is the first time I’ve ever had anyone double up on data center bozo of the week. I’ll take suggestions from the floor on what would be an appropriate award for this dubious honor.

Email or call me or visit the SwiftWater Telecom web site for green and reliable data center DC power, bozo free!

Vern

Monday data center tidbits:Amazon bozos cloud computing (again)


I’ve just been reading about the latest failure of Amazon’s EC2 cloud computing service. This is screwed up on so many levels.

1. Would you REALLY run operating servers from just a generator alone? Did Amazon learn nothing from the IBM debacle that blew Air New Zealand out of the water?

2. Power segmentation in the data center serving cloud computing hosts is critical. The same failure that used to take down one customer now takes down 10 (or more).

3. Taking 7 hours to restore virtual machines (instances) is ridiculous.

It appears to me that Amazon instances can’t be restarted on anything except the physical hosts they were running on. This flies in the face of the very idea of cloud computing.

In our cloud computing service, virtual machine and host watchdogs monitor for failures of virtual machines and physical hosts. Virtual machines that fail are automatically restarted, physical cloud hosts that fail are automatically cleared and their virtual machines restarted on a running host, maximum time to restore customer virtual machines (short of a total meltdown) is 15 minutes. 7 hours is pathetic.

Data center bozos of the week goes out to Amazon (your multicolored wigs and squeaky noses are in the mail, guys!).

Email or call me or visit the SwiftWater Telecom web site for cloud computing services, bozo free!

Vern

Cloud computing not vaporizing anytime soon.


I just read a provocative piece about why cloud computing will vaporize. I think the oracle is out to lunch on this one.

First of all, we get an apples to oranges comparison to a technology that couldn’t deliver on its hype. Over priced and not solving any problem is certainly a recipe for failure, that’s for sure.

On the other hand, we find that cloud computing is pretty much delivering on everything promised. Energy saving (making it a critical component of any green data center plan) , cost saving, highly flexible, easy to deploy virtual servers on it, highly reliable (if done right), disaster recovery and avoidance, there’s a ton of benefits to cloud computing, all being proven in real action.

The next assertion is that you can’t build a sustainable business selling capacity unless you have a “distinct advantage”. I’m sure all the data center co-location, hosting, and dedicated server providers out there are surprised to know this. The infrastructure isn’t just going to appear magically and most of the people using it are NOT going to be building their own. I don’t believe the market for cloud capacity is just going to settle out on the 900lb gorillas of giant corporations, any more than the traditional data center market did. There’s certainly plenty of little guys making it out there.

In general, the largest cloud computing providers are the easiest to differentiate against. Amazon, Google, Microsoft, Rackspace, IBM, and others all have had repeated major self inflicted cloud outages in 2009 with major impact to their customers and just awful customer service response to them. The easiest differentiator is doing it well when the competition is screwing it up royally. It’s actually easy to remove the big boy’s differentiators, since I can easily build my own cloud with Amazon’s EC2/S3 (Eucalyptus) interface or Google’s App Engine (AppScale) interface.

As for there being a “zillion” cloud computing providers out there, I’ll believe it when I see proof of the number. Ones who do it well will survive, ones who do it poorly will not, just the way it’s always been. I don’t believe that only people who build the apps also will survive. I do believe you have to educate people how to apply cloud computing and that’s what I do with our cloud.

In light of the incredible demand, I’m not giving up on our cloud anytime soon. I think the original author needs something besides goat entrails before making his next prediction.

Vern

swiftwater telecom cloud computing logo

The virtual data center, whose “cloud” is it anyway?


I’ve been reading this morning about “opening the cloud”. I don’t think I’ve ever seen such total misunderstanding about what “the cloud” really is.

First of all, there is no “the cloud” (I cringe every time I hear that said). Google has a cloud, Amazon has a cloud, SwiftWater Telecom has a cloud (that’s us 🙂 ). There is no giant, amorphous, entity identifiable as “the cloud” (it’s almost as bad as hearing people talk about “the Google”).

Next is the idea of wresting “the cloud” away from corporate and government control, as if there’s something inherently wrong with “cloud capitalism”. Companies that provide cloud based services are not just “organizing” things on “the cloud”. The money for servers, facilities, and software for a cloud don’t magically appear from nowhere. If you want a free cloud with no corporate attachment, ante up and build one yourself, then you can give it away for free to whoever you like.

The idea that Google’s fight with China, which has devolved to censorship of Google’s search engine results, has anything to do with fighting for control of “the cloud” is ridiculous. It has everything to do with censorship, not with any legitimate security concern of China. And yes, Google owns their cloud that they provide Gmail from. It’s Google’s right to operate their facilities the way they see fit, it’s their property.

I don’t really know what to say about the idea that Google is creating a “cloud” of 10 million books (all I can picture is getting rained on by encyclopedias). I’m not commenting on Google’s plan to put books online, but I’m not sure what the connection is between digitizing books and using cloud powered services to store or disseminate them. Calling Wikipedia a “cloud” is just as incomprehensible.

I don’t believe there’s any lack of competition in cloud computing, as asserted by the article. There’s no way that “the Google” can stop me from building a cloud, acquiring customers for my cloud, or control the content or services of my cloud (I havn’t gotten any cease and desist letters from them yet). I don’t know how any corporate entity could possibly manage to control “the cloud”, you might as well try to take over the entire Internet (What are we going to do tonight, Brain? The same thing we do every night, Pinky. Try to take over the world!)

The fact is, cloud computing advances all the social agendas listed. You can use someone’s cloud powered apps (SaaS, software as a service), you can use a cloud to develop and run your own apps (PaaS, platform as a service), or you can have a cloud powered virtual server of your very own (IaaS, infrastructure as a service). All this comes with far lower cost and lower barrier to entry. People who could never have afforded a data center dedicated or co-located server can now have their own virtual server, completely under their control. Doesn’t sound much like evil corporations trying to rule the cloud, does it?

It’s time to embrace cloud computing and stop babbling about mythical corporate overlords. This isn’t Skynet and the Terminators aren’t on their way.

Vern

swiftwater telecom logo

Saturday data center tidbits.


First up is reading about the performance issues with Amazon’s EC2 data center cloud system. You can argue back and forth about over capacity or not but 1000ms latency is a serious problem. If you’re running interactive services, such as virtual workstations, the cloud is all about the customer experience and that experience should be an all hands on deck alarm.

Next up is reading about the recent attacks on Google pointing to problems with the idea of cloud computing. I don’t understand why people can’t see why bad system administration practices or risky user behaviors (malware, phishing) on cloud based virtuals aren’t the real problem, just exactly like they are on dedicated servers. I have yet to see ANY “cloud computing security problem” that wouldn’t have happened EXACTLY the same way on dedicated servers. Neither cloud computing nor traditional dedicated servers are a panacea for bozo system administration or inviting the attacker in.

Vern

swiftwater telecom logo