Tonight I was reading this piece about bad choices in storage creating a cascading failure for an email service provider. Mistakes were compounded on mistakes here, but I think the smartest way to keep this from happening would be to avoid it all together.
The premise is pretty simple. Customer buys storage with redundant controllers. Customer installs equipment and forgets about it. Load increases to the point where it exceeds the capacity of one controller (no alert from equipment). One controller fails. Service slows to a crawl (customer is not monitoring the equipment).
Obviously there were a number of dumb things done here. Not monitoring load and allowing capacity to drop below N+1, resulting in a cascading failure when the first controller failed, is a big one (read up on Google’s gmail debacle from last year for another perfect example of this). Not monitoring for the hardware failure is another. Not engineering for growth is a major one.
It goes without saying that doing a decent job of engineering and monitoring the equipment would have nipped this problem is the bud. Also, choosing equipment that was more proactive about alarming traffic overload conditions would have been a good idea, as well as better, albeit more expensive, redundancy.
I think there’s another, better, alternative. Build the service on an IaaS cloud computing service. Consider disaster recovery (or even better, disaster avoidance) with our cloud computing service.
Have a virtual server shut down unexpectedly? Our VM watchdog service automatically restarts it. Have a load balancer virtual machine fail? The redundant load balancer automatically takes over. Have a physical host server or the cloud pool master fail? Our host watchdog automatically restarts the virtual machines on a running host server.
Need more CPU power or memory? 30 seconds adds either one. Run out of capacity altogether? 5 minutes clones an existing virtual web server, sets a new IP address, and adds it to the balancer.
Not only can our cloud computing service handle almost anything short of a smoking hole disaster without having to be watched, flubbing capacity engineering is no longer an expensive and painful thing to correct. It’s now simple to avoid overload failures as well as adjust for unanticipated load increases. And best of all, you no longer have to deal with a fallible and overpriced piece of prima donna hardware in the data center.
The best solution to a problem usually is to avoid it completely in the first place.