Todays post comes from reading an article about cloud computing and data center efficiency. I’m going to talk about one of the issues touched on, reliability.

The cloud implementations of some of the early major players have had large and public failures, in some cases, repeated. The distributed nature of the data center cloud lends itself to a number of fairly simple steps to insure reliability.

So, how do we help prevent data center cloud burps? The first thing is to segment the infrastructure the cloud depends on. Split power feeds between fairly small segments of the cloud to reduce the impact of power problems.

Segment the cloud network and provide redundant network paths, especially in the storage network. The storage network is the heart of the cloud and it has to be as rock solid and fault tolerant as possible.

Cooling is another area to segment. Is there a pattern showing here? The secret to data center cloud reliability is to segment so that any infrastructure failure impacts as small an amount of the cloud as possible. Minimizing the impact allows the cloud to be restored to full operation in case of failure. Operating degraded is also far better than not operating at all!

The other advantage of segmenting is to reduce the possibility of cascading failures. Since the cloud is so interconnected, any failure that ripples through the cloud would be catastrophic. Now you’ve gone from an almost unnoticeable impact to major impact and extensive restore time.

The nature of the data center cloud not only increase energy efficiency but also “should” benefit reliability. Individual failures will still happen but the overall effect on the cloud will be limited.

So, why the high profile cloud failures? Lack of storage redudancy, inviting cascading failures (multiple Google outages), inviting human error into things like power without segmentation (short a PDU and half the data center chokes).

Sometimes, the best even the big boys can do is serve as a bad example.

