I was just reading an article by David Linthicum about combating cloud outages. While not really wrong, I think it misses the point about what it takes to really keep a cloud up and flying.
The “core issue” with cloud computing failures is NOT overcapacity. A quick look at major failures over this year and last shows everything from human screwups (Google Gmail failure from botched maintenance work in 2009) to endless lists of power related problems (4 failures of Amazon’s EC2 service in one week) to, yes, over capacity issues (Google AppEngine, repeated Twitter failures).
The human caused cloud failures have been especially confounding. Failures of untested software, power equipment installed without even a bother to check the configuration (one of the recent Amazon EC2 failures), the list of incompetent engineering and operation incidents is astonishing.
So what is the real core issue with cloud computing failures? Aside from the obvious screw ups and foul ups, the real issue is the magnifying effect of the cloud. The increased “population density” on the same hardware magnifies the effect of any failure.
Power fail one dedicated server back in the days P.C. (pre cloud) and you took out one customer. Power fail a single cloud server and now you’ve knocked out 10 (or far more) customers. The failure modes aren’t significantly different in a cloud, the magnitude of the effect is.
So what is the solution?
1. Meticulous attention to detail in constructing, engineering, and operating the cloud. Take the human goofs out of the equation.
2. Never ever ever load any software into the cloud itself that hasn’t been tested thoroughly. This should be obvious but for some reason it isn’t (this is why we operate a “stunt cloud” in addition to the production cloud).
3. Segment the cloud infrastructure (power). No attention to detail is ever going to be perfect so minimize the amount of the cloud a failure can take out.
4. Automate, automate, automate. Rebalance workloads to compensate for down infrastructure and detach down hosts and restart their workloads on running hosts, automatically.
On our cloud computing service, anything short of a smoking hole disaster automatically starts restoring workloads in 5 minutes with an absolute maximum of 15 minutes to all restored. Compare this to the 7+ hours restore times for Amazon EC2 outages.
Notice I didn’t say anything about capacity here. Adding capacity to a cloud is one of the easiest and fastest things to do (we go from bare server to loaded and operating in the cloud in 10 minutes or less).
The real key to keeping the cloud flying is to minimize the goofs, limit the effect of the goofs, and automate a lightning fast response to the goofs that make it by despite the best intentions.
Cloud failures happen, it’s the response to them that makes the difference.
Email or call me or visit the SwiftWater Telecom web site for cloud computing services.