#datacenter failures and restoring a cloud #cloudcomputing

I’ve been reading lately about the multiple recent failures that have plagued Amazon’s EC2 cloud computing service (4 in one week!). I’m not going to address the ongoing power reliability problems (I was amuse to see a recent YouTube video touting how reliable Amazon is) but I’m going to talk about cloud failure recovery.

The failure modes in a cloud computing system can include inadvertant virtual machine shutdown, partial failure of connectivity or operation of the cloud (likely due to data center failures such as Amazon’s power failures), or total failure of the entire cloud (the smoking hole disaster). Failures will always happen but taking 7 hours to restore customer virtual machines (as Amazon has done) is pathetic.

In our own cloud, based on the excellent Xen Cloud Platform, we’ve created a fully automated front end package called Xen Cloud Control System. As part of XCCS, we’ve included 3 pieces that automate the response to these type of cloud failures.

The virtual machine watchdog finds and restarts virtual machines that get unintentionally shut down, automatically, without any need for a report from the customer. Remember, detecting and resolving a failure before the customer squawks is the best policy.

The host watchdog detects and clears failed host servers so that the virtual machine watchdog can restart the virtual machines. This gets the virtual machines restarted on another running host, instead of waiting for the failed host to be repaired as Amazon seems to require.

The final piece of the puzzle is the load balancer. The load balancer automatically readjusts the cloud, compensating for failures.

These 3 pieces not only keep the cloud running as smoothly as possible, they insure the maximum time to restore customer virtual machines is 15 minutes as long as any of the cloud is still running. This is also all done automatically, no operator intervention required.

I have no idea what the heck Amazon is doing, but if the worst happens, 15 minutes to restore service on a cloud beats the stuffing out of 7 hours.



2 responses to “#datacenter failures and restoring a cloud #cloudcomputing

  1. I’m not sure that ‘automatically’ recovering a machine is all that useful is it? All that I’m going to get is the state of that machine the last time that its state was copied to disk. Ok if I’ve got a simple application model (eg static web pages or some db that’s continuously backed up), but not if I’ve got much state in my application, especially if the state is distributed.

    Isn’t it better to get the application to spot disappeared machines and re-instantiate new ones from a known state? The recovery would know what to recover and how, rather than guessing that restoring a specific machine state is what’s needed and not telling the customer that the machine’s been down?

    • Tim,
      When I said automated recovery, I meant restart. I’d never try to restart a virtual machine from a state snapshot unless it was a totally standalone that didn’t depend on or effect any other machine, you’re quite right that mixing states would be a nightmare.

      If the VM has been down, rebooting from scratch is what the customer is going to be doing anyways, automating the detection and reaction to it just puts the monitoring and restart in the cloud where it belongs, not making the customer mess with it manually (especially if a cloud failure is the cause of it in the first place).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s