This morning I was reading about gracefully degrading Web 2.0 applications in response to server overload issues. This is an important point to maintain your customer’s user experience, but it also ties in to the ability of the dedicated or cloud computing virtual servers to adapt to unforseen conditions.
There are any number of unpredictable things that can negatively impact the experience of customers using your web site, web app, or other Internet based service. Servers can fail, network links can fail, demand can spike drastically due to totally unpredictable influences, and, if you’re not careful, small failures can cascade into large ones. The important thing is to make the service resilient enough to either continue to operate through the failure or demand spike or to be flexible enough to be rearranged to adapt to the situation.
In general, I believe that having things operating in a degraded state is a far better response than having things unavailable at all. Customers hate uncertainty and an unavailable service is the height of uncertainty. Is the service down? Is it coming back? Is it a problem on the customer end?
One method of dealing with this is the way Google handles the routers that feed their Gmail service. When a router becomes slower than a set benchmark, it sheds its load to the other routers and marks the path “do not use”. This is a dangerous response, since shedding the load to other facilities can result in a runaway failure cascade, taking down path after path at increasing speed, until every path is offline from being overloaded as happened to Google in recent failures of its Gmail service. Now you not only have a broken service, you have a nightmare to restore the service without triggering the same cascade all over again.
The first response to overload should be to shed non-critical loads. It’s better to have individual features unavailable (with an appropriate message to the customer so they know what’s going on) while protecting the operation of the core services. This can even be done automatically on the server, based on error logs and load measurements.
The second response is where the flexibility of cloud computing shines. Starting new virtual servers on a cloud is easy, fast, and overloads can be diverted to them in minutes. The capability of cloud computing to absorb and quickly recover from failures also dramatically reduces the potential impact to site, especially if that site is built to take advantage of the flexibility of cloud computing with our high reliability web site package.
Remember, keeping some of the lights on is better than leaving your customers in the dark!