I’ve been reading lately about the multiple recent failures that have plagued Amazon’s EC2 cloud computing service (4 in one week!). I’m not going to address the ongoing power reliability problems (I was amuse to see a recent YouTube video touting how reliable Amazon is) but I’m going to talk about cloud failure recovery.
The failure modes in a cloud computing system can include inadvertant virtual machine shutdown, partial failure of connectivity or operation of the cloud (likely due to data center failures such as Amazon’s power failures), or total failure of the entire cloud (the smoking hole disaster). Failures will always happen but taking 7 hours to restore customer virtual machines (as Amazon has done) is pathetic.
In our own cloud, based on the excellent Xen Cloud Platform, we’ve created a fully automated front end package called Xen Cloud Control System. As part of XCCS, we’ve included 3 pieces that automate the response to these type of cloud failures.
The virtual machine watchdog finds and restarts virtual machines that get unintentionally shut down, automatically, without any need for a report from the customer. Remember, detecting and resolving a failure before the customer squawks is the best policy.
The host watchdog detects and clears failed host servers so that the virtual machine watchdog can restart the virtual machines. This gets the virtual machines restarted on another running host, instead of waiting for the failed host to be repaired as Amazon seems to require.
The final piece of the puzzle is the load balancer. The load balancer automatically readjusts the cloud, compensating for failures.
These 3 pieces not only keep the cloud running as smoothly as possible, they insure the maximum time to restore customer virtual machines is 15 minutes as long as any of the cloud is still running. This is also all done automatically, no operator intervention required.
I have no idea what the heck Amazon is doing, but if the worst happens, 15 minutes to restore service on a cloud beats the stuffing out of 7 hours.