Tag Archives: Google gmail failure

The virtual data center, dealing with a cloud overload.

This morning I was reading about gracefully degrading Web 2.0 applications in response to server overload issues. This is an important point to maintain your customer’s user experience, but it also ties in to the ability of the dedicated or cloud computing virtual servers to adapt to unforseen conditions.

There are any number of unpredictable things that can negatively impact the experience of customers using your web site, web app, or other Internet based service. Servers can fail, network links can fail, demand can spike drastically due to totally unpredictable influences, and, if you’re not careful, small failures can cascade into large ones. The important thing is to make the service resilient enough to either continue to operate through the failure or demand spike or to be flexible enough to be rearranged to adapt to the situation.

In general, I believe that having things operating in a degraded state is a far better response than having things unavailable at all. Customers hate uncertainty and an unavailable service is the height of uncertainty. Is the service down? Is it coming back? Is it a problem on the customer end?

One method of dealing with this is the way Google handles the routers that feed their Gmail service. When a router becomes slower than a set benchmark, it sheds its load to the other routers and marks the path “do not use”. This is a dangerous response, since shedding the load to other facilities can result in a runaway failure cascade, taking down path after path at increasing speed, until every path is offline from being overloaded as happened to Google in recent failures of its Gmail service. Now you not only have a broken service, you have a nightmare to restore the service without triggering the same cascade all over again.

The first response to overload should be to shed non-critical loads. It’s better to have individual features unavailable (with an appropriate message to the customer so they know what’s going on) while protecting the operation of the core services. This can even be done automatically on the server, based on error logs and load measurements.

The second response is where the flexibility of cloud computing shines. Starting new virtual servers on a cloud is easy, fast, and overloads can be diverted to them in minutes. The capability of cloud computing to absorb and quickly recover from failures also dramatically reduces the potential impact to site, especially if that site is built to take advantage of the flexibility of cloud computing with our high reliability web site package.

Remember, keeping some of the lights on is better than leaving your customers in the dark!

Cloud computing services


swiftwater telecom logo


The data center in review, top 10 bozos of the year, 2009!

My coined term from my data center analysis and commentary this year: bozosity. Bozosity is a condition brought on by an imbalance of invisible particles known as bozons. This condition causes otherwise competent and sensible people to do incomprehensibly boneheaded things.

The Winners of the 2009 Data Center Bozo Awards are:

1. Microsoft and Danger for the T-Mobile Sidekick data loss debacle. MicroDanger did not win for operating critical storage systems without backups, but for the handling of the aftermath. MicroDanger immediately announced all data was lost, then, by the time they did recover most of the data, significant damage was done to T-Mobile and the sidekick, leaving everyone involved with a reputation for incompetence.

2. Fisher Plaza for knocking out major Internet services by blowing up an antquated, obsolete, and improperly maintained electrical system in their data center building. Aluminum bus bars are evil, k?

3. IBM for blowing Air New Zealand out of the water by performing power work during peak load period of one of Air New Zealand’s busiest travel weekends, unnecessarily running ANZ’s mainframe from a fallible generator alone, and taking an inordinate amount of time to restore service.

4. IBM for allowing a state of Texas elections commission storage system in their care to fail because it wasn’t in the contract to back it up.

5. Google for their brilliant example of cascading failure by sequentially overloading every router feeding their Gmail service.

6. Research in Motion for seeing how many BlackBerry back end failures they could squeeze in before the end of the year.

7. Amazon, Rackspace, Google, and a number of other providers who managed to blacken the term cloud computing by multiple reliability problems, most of which were self inflicted. Thanks a heap guys.

8. DreamHost for giving us a shining example of how NOT to do a major data center migration.

9. The people who operate Sweden’s top level DNS domain for turning lose an untested script and taking the entire thing down. Who knew a few missing dots could be so much trouble?

10.The Perth iX data center in western Australia for allowing a smoldering mulch bed outside the building to shut down the entire data center because they couldn’t locate a miniscule amount of smoke that was infiltrating the building and setting off an overly sensitive detection system.

Finally, I’d like to add a “dishonorable mention” award to FedEx for turning overnight delivery of a critical part into 3 days and nearly sticking me with working in the data center overnight Christmas Eve.

Looks like we survived the year but it sure wasn’t pretty.

Vern, SwiftWater Telecom

Adventures in the data center: The cascade effect …

This weekend’s adventure in the data center came courtesy of an expensive (and now defunct) APC rackmount power strip. What was interesting about this was not the failure of the unit (although I’ve never seen a catastrophic power strip failure in my 20+ years of this) but the cascading effects.

What are cascading effects? Cascading effects are widespread problems that begin from a small starting point and snowball into a much larger problem. In this case, the APC failure caused a fair amount of equipment other than what was directly connected to it to fail (leading to a near meltdown of the BlackBerry).

Cascades can make it very difficult to diagnose the fault. Large amounts of “collateral damage” can lead to information overload (yes, there IS such a thing as too much info!). This can easily lead to things taking far too long to diagnose or an inaccurate result (which can be awful if a tech drives 3 hours with the wrong part).

The first key to preventing the dreaded cascade is segmentation. This provides natural “breaks” to block the spread of a cascade, limiting the size and effect of the failure. Remember, the smaller the scope, the more likely a fast and accurate diagnosis is. It may seem easier and faster to just tie everything into a flat, unified architecture, but that’s just opening the door.

The second key is to program equipment intelligently, if not to stop the cascade, at least not to contribute to it. A great example of how to do this wrong was one of the recent Google outages. As an overloaded router responded slower and slower, it recognized that it wasn’t performing to par and simply dumped its load on the rest of the routers in the pool. This caused more routers to overload, dumping their loads to the ever shrinking pool, until, finally, there weren’t any routers left at all to serve any traffic. Now you have the interesting conundrum of how to restore service, since any router you managed to clear and restore to service would almost immediately overload again!

The final key is not to overload with information about the failure. Identify dependencies and set your monitoring system up accordingly (device A is dependent on device B to work so don’t alarm on A if B is down). Remember, concise and accurate information gets the diagnosis and the fix the fastest!

There is a brief rundown on keeping the evil cascades out of your data center. Now I’m going to take this APC unit out in the parking lot and run over it with a truck.

Vern, SwiftWater Telecom
data center, web hosting, Internet engineering

Take careful aim at foot, FIRE!

Today I’ve been reading about Google’s GMail outage. This outage was triggered by a routine router maintenance that reduced capacity below demand and produced a cascading overload.

This type of self inflicted catastrophe seems to be becoming far more common. In this post, I’ll be examing some of the failures that lead to this kind of result.

The first failure is the failure to anticipate the cascade failure and create cutouts and other controls to contain it. There’s no shame in a single failure, the shame is allowing it to spread out of control. There’s no substitute for knowing your facility and having the proper tools to know exactly what’s happening.

The second failure is the failure to plan a proper maintenance. Of course it’s important to plan the maintenance so that things go right, but it’s even more important to plan for things going wrong. Every step should have a way for the maintainer to know if it had a negative effect and a way to immediately back the actions out, returning things to the previous state.

Even more important is a plan of last resort, an “all hands on deck” emergency response to a situation that can’t simply be reversed out.

The basic failures in the Google outage really leave me shaking my head. If you have n+1 (enough facility to support requirements plus 1 extra) redundancy, reducing that to n is a risky proposition. Reducing to n-1 or lower is a recipe for absolute disaster.

So what is the takeaway from these failures? Plan for the possibilty of failure and how to mitigate it BEFORE it impacts your customers.

There are plenty of competitors waiting to take your business down, don’t help them by shooting your own feet!

Vern, SwiftWater Telecom
Data Center, Telecom, Engineering Services