Well, here we go again. Apparently, RIM has managed to barf the Blackberry Internet Service yet again (disclaimer, I’m writing this from my Curve). I’m fortunate enough to be using BOLT web browser and Twibble, so the only real things affecting me are email and Yahoo Messenger loss.
This most recent of high profile outages leaves me wondering about the state of data center operations today. In most businesses, it used to be that if you caused a major outage impacting customers, there would be severe consequences.
Data center operators seem to have acquired a much large tolerance for pain. While some data center outages have been the result of simple infrastructure failure, most of the ones this year have been the results of human error. Deployment of inadequately tested software, risky and unecessary manual testing, poorly planned maintenance operations with no way to detect or back out problems, operating storage systems without backups, just general ignoring of good system admin and engineering procedures. Couple this with lousy problem response (the BlackBerry problem has gone on for hours) and it’s a real toxic problem.
I don’t know why data center operators have abandoned good admin, operations, and engineering practices. I do know that whatever these companies think they’re gaining from this, it’s not enough for me to run my data center in such a mode, I’m not tolerant enough to pain.
I don’t think I’ll be giving up my BlackBerry anytime soon but I will be making sure I use 3rd party applications that don’t involve RIM. In the meantime, if there’s anyone out there that needs data center service that don’t require developing high pain thresholds, please let me know.
Vern, SwiftWater Telecom