There have been a number of very high-profile service outages recently – Amazon AWS in April1, United Airlines in June2, and just this week, RIM. I doubt it’s the last. While it’s interesting to wonder how these companies could have let such critical services go down at all, there is a more fundamental problem here. It’s the network.
As RIM noted on its website, "The messaging and browsing issues many of you are still experiencing were caused by a core switch failure within RIMs Infrastructure."
In my view, RIM’s outage is just one more example of how broken and mis-applied “modern day” networking technology is for our mission critical services. Traditional networking gear held together by overly complex protocols and schemes simply isn’t working.
I won't join the bandwagon speculating on whether this is the nail in RIM’s coffin. As a former Crackberry addict, I'm still rooting for them! I do find it interesting though that a company that has operated a rock solid enterprise network for so many years (the best in the enterprise business by most measures) is now the latest victim of a “networktasrophe.” From what I've read in the press (most of it unsubstantiated) it sounds like a large switch had a module failure and the supposed redundancy scheme (likely something like VRRP) did not operate as planned. Although, a more likely explanation is someone fat fingered a command in a 1980’s era command-line interface, and all hell broke loose. We can certainly blame RIM for not testing their redundancy architecture or procedures well enough, but let’s not overlook the fundamental problem – the network and its apparent fragility. Everything from the complexity (and corresponding fragility) of the protocols to the downright user-unfriendly management needs to be rethought for today’s critical data and service needs. We built these systems to solve a different problem (and under a different set of constraints) than we’re trying to address with them today. And the band-aid layers we've applied over the many years have made the network an overly complex and inflexible house of cards at risk of coming down at the slightest mis-key.
1Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region - aws.amazon.com/message/65648/
2United blames outage on network connectivity issue, Bloomberg Businessweek, June 18, 2011 -www.businessweek.com/ap/financialnews/D9...