Tag Archives: failure cascade

Thursday data center tidbits: Chaos at Facebook (again)


For more than 5 hours now, developers have been unable to make changes to the Facebook Integration settings for their apps. Attempts to change these settings returns:

Sorry, something went wrong.

We’re working on getting this fixed as soon as we can.

This failure doesn’t seem to be affecting apps that are currently running but it has dragged a fair amount of app development right down to a total stop.

This failure comes close behind the recent major Facebook outage caused by a runaway software integrity checker.

Vern

SwiftWater Telecom

Advertisements

Balance the data center with green DC power.


This morning I was reading about preventing cascading failures in the data centervia power load balancing. I’ve written previously about not allowing data center failures to get out of control and cascade into a much larger event, in this post I’ll be talking a little bit about the importance of power balancing.

At the simplest, power load balancing is just what it says. Power loads should be balanced at distribution points such as breakers and balanced across redundant feeds to equipment. This insures that a configuration that works fine under normal conditions doesn’t overload and trip in failover conditions. Beyond the obvious of making sure that the power configuration will reliably work under all load conditions, there are also power quality issues as well.

Most industrial/commercial locations are supplied by 3 phase AC power in either a delta or wye configuration, 3 hot phases, and 1 neutral (wye only). Single phase equipment is supplied from either 1 hot phase and the neutral or 2 hot phases. The issue will this type of power is that load imbalance of more than 10% between the hot phases can cause significant misbehavior of the transformer not to mention efficiency problems as well. We just had a landlord correct a load 3 phase load balancing issue at one of our facilities, 95% of the distribution load was on 1 phase of the breaker. This imbalance caused what should have been 120VAC to drop as low as 100VAC.

So where does data center DC power enter into this? A modern modular DC power plant collapses the 3 phase power into one DC power feed. Since the rectifier modules operate in parallel, the load on each AC circuit will be perfectly divided between them. Compare this to AC powering equipment where the load of the equipment is constantly changing, meaning that you could never balance this type of load with any precision. With a DC power plant, as long as you run modules in multiples of 3, you’re guaranteed to maintain perfect load balance.

Efficient in operation, efficient in use of your AC power capacity, making the power balancing act look easy, DC power is the choice for the green data center.

Vern

swiftwater telecom logo

Wednesday data center tidbits.


First up this morning is the story about Bell Labs and Green Touch committing to reduce the power consumption of data networks by 1000x. I’m not sure that kind of a gain is achievable but I’m fairly certain this is low on the bang for the buck scale.

Second up is the revelation that Google is seriously considering walking away from China over rampant cyber attacks on its data centers. This surprises me not in the least. 99% of the attacks on my servers here are from China, with the remaining amount from Korea and Eastern Europe.

Finally we have an article on preventing cascading failures in the data center. Balancing power loads is absolutely critical to data center reliability, especially when 3 phase AC power is involved. Imbalance on 3 phase can cause some severe power quality issues. After having multiple breaker trips at one of our locations (no downtime thanks to an excellent DC power plant in the data center!), the landlord discovered that 95% of the load was being carried by only 1 phase of the 3 phase feed on that breaker.

Vern

swiftwater telecom logo

The virtual data center, securing the cloud.


Tonight I’ve been reading an analysis of what business functionality shouldn’t be moved into the data center cloud. Hugging servers in your own data center can be comforting, but it’s far from a guarantee that you’re not going to have a security breach or a storage failure wipe out critical data.

In recent months, high profile rookie mistakes by Microsoft and subsidiaries and general lack of operational competence by Google have managed to blacken the reputation of everyone involved in cloud computing. The Microsoft and Danger fiasco of losing the T-Mobile Sidekick users data was the result of poor facilities design, not anything inherently wrong with cloud computing itself (who would have thought that you should back up critical storage?). The same results would obtain from implementing classic data center design without any storage backups.

Google’s high profile flops generally seem to involve human errors and improperly planned maintenance operations creating cascading failures (who would have thought reducing number of routers below minimum load requirements would blow the whole thing up?). Once again, this can happen in any data center environment, given the same bozo behavior.

So, how do you insure security and reliability of the cloud? Regarding security, I’m not aware of any security incidents that were the result of or exacerbated by the presence of cloud facilities. OS bugs in cloud hosted virtual machines and user system administration mistakes are a much greater threat to the security of the user data, just exactly like traditional dedicated data center servers.

For reliability, the answer isn’t hard. Good power quality, multiple power sources for multiple portions of the cloud (a single power failure should NEVER take the entire cloud down!), monitoring of all cloud hosts, failover so that the impact of cloud host failures on the user is minimal to zip, and a solid back end storage system with appropriate redundancy and off site backups. Prevent human error failures by careful planning of maintenance operations (avoid the failure, recognize the failure quickly, back out of the action that caused the failure quickly). These are all things we do with our Aurora Resilient Cloud Service (just released today!).

Making the cloud solid and reliable isn’t some kind of new and mysterious black magic, it’s mostly the same tried and true recipe for reliability that’s been used for years. Just because the cloud is new doesn’t mean we have to do it badly while we reinvent the wheel.

Vern, SwiftWater Telecom

data center cloud services

Wednesday morning data center bits …


First up today is the debate about the smart power grid using IPv6. Is this really an issue? I don’t see any issue either way except the saving of 10s of millions of scarce IPv4 addresses. Talk about overanalyzing things to death.

Rackspace says that they have “work to do” on their DFW data center that recently experienced another in a string of power failures. I think they need an exorcism.

Open question and answer session for data center, green data center, networking, or what have you. Drop by and visit!

Vern, SwiftWater Telecom

Data Center Follies, Microsoft and T-mobile keep on floundering


This morning I was reading about some T-mobile Sidekick customers could recover data. I think Microsoft has the wrong definition of the term “backup”.

Microsoft announced that, contrary to rumors, they DID have backups of the T-mobile Sidekick data, however, the failure disrupted the backup database as well as the primary database. Now, I’m making the assumption here that the backup database was not stored on the same physical hardware as the primary (that would be phenomenally stupid). So, what aspects of the term “backup” did Microsoft miss here?

Any “backup”, whether it’s server functionality, data, data center infrastructure (power, redundant Internet connections, etc), all shares the same characteristic. A “backup” MUST be isolated from failure of the primary. The function of the backup is to provide disaster recovery and/or business continuity. If you allow the same failure to take out both primary and backup facilities, the backup isn’t only worthless, it’s actually damaging.

A poorly thought out backup facility can be damaging because it provides a false sense of security. Anything else that is built on top of that flawed foundation is now at risk. Suddenly, what would have been limited to a localized hit if it failed becomes a cascading catastrophe. Not the kind of scenario that should let any data center operator or system admin sleep well at night.

Fortunately, the answer is simple. Insure that your primary and backup facilities are not only logically separated but also physically separated as well. Make sure that there’s no way that a failure of one can damage the other.

Then get some good sleep.

Vern, SwiftWater Telecom

server co-location and data center services
data center facilities engineering

Adventures in the data center: The cascade effect …


This weekend’s adventure in the data center came courtesy of an expensive (and now defunct) APC rackmount power strip. What was interesting about this was not the failure of the unit (although I’ve never seen a catastrophic power strip failure in my 20+ years of this) but the cascading effects.

What are cascading effects? Cascading effects are widespread problems that begin from a small starting point and snowball into a much larger problem. In this case, the APC failure caused a fair amount of equipment other than what was directly connected to it to fail (leading to a near meltdown of the BlackBerry).

Cascades can make it very difficult to diagnose the fault. Large amounts of “collateral damage” can lead to information overload (yes, there IS such a thing as too much info!). This can easily lead to things taking far too long to diagnose or an inaccurate result (which can be awful if a tech drives 3 hours with the wrong part).

The first key to preventing the dreaded cascade is segmentation. This provides natural “breaks” to block the spread of a cascade, limiting the size and effect of the failure. Remember, the smaller the scope, the more likely a fast and accurate diagnosis is. It may seem easier and faster to just tie everything into a flat, unified architecture, but that’s just opening the door.

The second key is to program equipment intelligently, if not to stop the cascade, at least not to contribute to it. A great example of how to do this wrong was one of the recent Google outages. As an overloaded router responded slower and slower, it recognized that it wasn’t performing to par and simply dumped its load on the rest of the routers in the pool. This caused more routers to overload, dumping their loads to the ever shrinking pool, until, finally, there weren’t any routers left at all to serve any traffic. Now you have the interesting conundrum of how to restore service, since any router you managed to clear and restore to service would almost immediately overload again!

The final key is not to overload with information about the failure. Identify dependencies and set your monitoring system up accordingly (device A is dependent on device B to work so don’t alarm on A if B is down). Remember, concise and accurate information gets the diagnosis and the fix the fastest!

There is a brief rundown on keeping the evil cascades out of your data center. Now I’m going to take this APC unit out in the parking lot and run over it with a truck.

Vern, SwiftWater Telecom
data center, web hosting, Internet engineering