Tag Archives: Rackspace

Monday data center tidbits.


First up today is the story about cloud provider data center revenues. Rackspace is 100% correct to measure this as revenue per server. In terms of the business bottom line, resource revenue vs resource cost is the meaningful number. ARPU is useful in telling you if you’re not diversified enough on your customer base (too high a revenue number from too small a customer base) or for comparing the cost of acquiring a customer to the revenue from that customer but it doesn’t you the actual economics of running the service.

Second is the story about recycling former data center dedicated servers into a cloud computing system. As long as the hardware is adequate to serve the purpose, this is a great idea. Reliability is always a concern, but a well maintained server with a reasonable amount of runtime that isn’t totally ancient is going to be much more reliable than a brand new server (most hardware failures occur soon after a server is put into service).

Last up is the story that the Google hack has been tied to China but not conclusively. I’ll just note here that 90% of the brute force SSH attacks I see on my servers here are Chinese IP addresses, with most of the rest being South Korea.

Call or email me or visit the SwiftWater Telecom web site for green data center and cloud computing services minus the hype.

Vern

swiftwater telecom rcs cloud computing logo

Monday data center tidbits.


First up, Rackspace blows up a UPS in London and puts hundreds of servers off the air for hours. Consequences were apparently severe enough to require server power supply replacement and network equipment reconfiguration. Sounds like whatever bug they have that keeps blowing up the power in DFW is migrating. Maybe this is one UPS they SHOULD have been doing maintenance on?

Watch for my post this week on effectively monitoring and controlling the cloud!

High reliability cloud powered virtual services available!

Vern

swiftwater telecom logo

The green data center, no trade off required for reliability.


Today I was reading about the “trade off between reliability and efficiency” in the data center. I think it’s far from the truth that you have to give up one for the other.

Part of the problem that causes this kind of misconception are obsolete classification systems, such as the Uptime Institute’s tiers (I’ve written before about my problems with that particular classification system). In the example given in the article, the data center operator in question had to maintain 1 to 1 hot standby servers for every operating server to achieve that particular tier rating, as if reliability couldn’t be achieved by anything less than exact duplicates of every piece of gear in the data center. Of course, the 2N approach ignores the possibility, what if you waste all that money and energy running 1 to 1 hot standbys, the primary fails, and then the secondary immediately fails?

Of course, the Uptime Institute’s response to this is to announce yet ANOTHER data center efficiency metric.

This also spotlights the weakness of using PUE as anything but an internal engineering aid. It sounds great that you have a 1.8 PUE but, since PUE doesn’t have any reference to the amount of work being accomplished, you’ll be wasting half the energy consumed on equipment producing no useful work. The cost of operating this way plus the likely upcoming carbon penalties will melt your credit card.

So, how you combine green data center techniques with high reliability? Here’s my recipe for it:

Add a green modular DC power plant with enough modules to provide n+1 (or 2 or 3) capacity. Split AC power feeds for the modules between 2 or more AC sources.

Add 2 parallel battery strings.

Add in 1 cloud computing cluster, such as the excellent Xen Cloud Platform we use. Provision enough cloud hosts for n+1.

Split cloud hosts with multiple DC power feeds.

Add in redundant storage servers.

Add in a load balancing system capable of automatically restarting virtual machines if a cloud host fails.

Add in good site design practices.

(note, this is exactly the way we run)

The fact is that nothing short of a smoking hole disaster is likely to interrupt the service provided by this configuration for longer than the time required to automatically restart a virtual machine. If the time to restart is an issue, specify a backup virtual machine on a different host in the cloud. Protect against the possible data center wide disaster with a duplicate configuration on a cloud in a second data center.

It’s really not that hard to achieve high reliability with green data center techniques (even if Microsoft, Amazon, Google, and Rackspace make it look like it is). Deep six the antiquated thinking and your wallet and the planet will thank you for it.

High reliability services available!

Vern

swiftwater telecom logo

Thursday data center additional, Rackspace slows to a crawl.


Just reading about major performance impacting storage load issues on the Rackspace cloud in their DFW data center that have been dragging on (what is it with DFW anyways?).

This isn’t rocket science guys, find the host with the most traffic to the SAN, find the virtual on that host with the most traffic to the SAN, and throttle it or kill it. Repeat as required. This is a great example of what happens when you don’t build traffic monitoring into your cloud.

Vern

swiftwater telecom logo

The data center in review, top 10 bozos of the year, 2009!


My coined term from my data center analysis and commentary this year: bozosity. Bozosity is a condition brought on by an imbalance of invisible particles known as bozons. This condition causes otherwise competent and sensible people to do incomprehensibly boneheaded things.

The Winners of the 2009 Data Center Bozo Awards are:

1. Microsoft and Danger for the T-Mobile Sidekick data loss debacle. MicroDanger did not win for operating critical storage systems without backups, but for the handling of the aftermath. MicroDanger immediately announced all data was lost, then, by the time they did recover most of the data, significant damage was done to T-Mobile and the sidekick, leaving everyone involved with a reputation for incompetence.

2. Fisher Plaza for knocking out major Internet services by blowing up an antquated, obsolete, and improperly maintained electrical system in their data center building. Aluminum bus bars are evil, k?

3. IBM for blowing Air New Zealand out of the water by performing power work during peak load period of one of Air New Zealand’s busiest travel weekends, unnecessarily running ANZ’s mainframe from a fallible generator alone, and taking an inordinate amount of time to restore service.

4. IBM for allowing a state of Texas elections commission storage system in their care to fail because it wasn’t in the contract to back it up.

5. Google for their brilliant example of cascading failure by sequentially overloading every router feeding their Gmail service.

6. Research in Motion for seeing how many BlackBerry back end failures they could squeeze in before the end of the year.

7. Amazon, Rackspace, Google, and a number of other providers who managed to blacken the term cloud computing by multiple reliability problems, most of which were self inflicted. Thanks a heap guys.

8. DreamHost for giving us a shining example of how NOT to do a major data center migration.

9. The people who operate Sweden’s top level DNS domain for turning lose an untested script and taking the entire thing down. Who knew a few missing dots could be so much trouble?

10.The Perth iX data center in western Australia for allowing a smoldering mulch bed outside the building to shut down the entire data center because they couldn’t locate a miniscule amount of smoke that was infiltrating the building and setting off an overly sensitive detection system.

Finally, I’d like to add a “dishonorable mention” award to FedEx for turning overnight delivery of a critical part into 3 days and nearly sticking me with working in the data center overnight Christmas Eve.

Looks like we survived the year but it sure wasn’t pretty.

Vern, SwiftWater Telecom

The reliable data center, maintenance gone wild.


I’ve just been reading about how the recent Rackspace data center network failure had limited impact and more of the details on this. Just another poorly planned, poorly executed, and poorly responded to botched maintenance operation.

It seems like at least half of the major data center and cloud outages this year have been the result of error during maintenance procedures or unnecessarily risky maintenance operations. So, what went wrong and how do you avoid becoming the next hot Twitter topic (in a bad way)?

1. They loaded an obviously inadequately tested configuration on a critical router. Test, test, and test again! The more critical the system is, the more pain from failure, the more testing needs to be done!

2. They messed with a critical router during peak load hours rather than a normal late night maintenance window. It’s nice that their facility integration was going to take place during a late night maintenance window, but how do you consider making a functional change to a critical router “preparation” that’s suitable for a high load period? If the worst happens, do NOT aggravate your customers by blowing things out of the water during the time they need it the most.

3. The length of the outage would imply that they either had no way to know the change had caused a problem in the network, or they had no recovery plan, or they chose to try to reengineer on the fly, or all of the above. Test for proper operation after every step of a maintenance plan and have an escape route to back out of changes if something unforeseen happens. It’s a bad idea to be messing around trying to determine what went wrong in the process while you have customers out of service.If it affects customers more than a few minutes, that’s bad.

Plan as well as possible, recognize when things go wrong, back out promptly, and schedule to minimize potential customer impact. These are the not so secrets to a pain free maintenance.

2009 seems to be the year for bad examples.

Vern, SwiftWater Telecom

12 days of data center Christmas, data center IP network engineering specials

Wednesday morning data center bits …


First up today is the debate about the smart power grid using IPv6. Is this really an issue? I don’t see any issue either way except the saving of 10s of millions of scarce IPv4 addresses. Talk about overanalyzing things to death.

Rackspace says that they have “work to do” on their DFW data center that recently experienced another in a string of power failures. I think they need an exorcism.

Open question and answer session for data center, green data center, networking, or what have you. Drop by and visit!

Vern, SwiftWater Telecom