The Wisdom of Clouds: Why it all boils down to infrastructure management...

Wednesday, July 25, 2007

Why it all boils down to infrastructure management...

Digital Daily made my day today with the story of data center operator 365 Main's declaration of its amazing up time record followed hours later by complete loss of their San Francisco data center for a good hour--a data center populated by a veritable "who's who" of Silicon Valley tech companies. As a result of the SF power outage, a good portion of the Web 2.0 social network world was also down.

What does this have to do with Utility Computing? Well, as Nicholas Carr points out, it has everything to do with the key problem of computing today: not software, not social networks, but infrastructure. 365 Main is (reportedly) one of the best collocation hosting facilities in the world, and despite a bunch of power failure backup, including two sets of generators, they didn't get the job done. Worse yet, they were bitten by something they can do little about: old and tired power infrastructure.

When we talk about utility computing architectures and standards, we must remember that, despite infrastructure becoming a commodity, it is still the foundation on which all other computing services are reliant. If we can't provide the basic capabilities to keep servers running, or quickly replace them if they fail, then all of our pretty virtual data centers, software frameworks and business applications are worthless.

For 365 Main, their focus initially will probably be on why all that backup equipment didn't save their butts as it was designed (and purchased) to do. I mean, as much as I would like to make this a story about how great Service Level Automation software would have saved the day, I have to admit that even my employer's software can't make software run on a server with no electricity.

However, and perhaps more importantly, the operations teams of those Silicon Valley companies that found themselves without service for this hour or so must ask themselves some key questions:

Why weren't their systems distributed geographically, so that some instances of every service would remain live even if an entire data center were lost? If the problem was the architectures of the applications, that is a huge red flag that they just aren't putting the investment into good architecture that they need to. If the problem is the cost/difficulty of replicating geographically, then I believe Service Level Automation/utility computing software can help here by decoupling software images from the underlying hardware (virtual or physical).
Why couldn't they have failed over to a DR site in a matter of minutes? Using the same utility computing software and some basic data replication technologies, any one of these vendors could have kept their software and data images in sync between live and DR instances, and fired up the whole DR site with one command. Heck, they could even have used another live hosting facility/lab to act as the DR site and just repurposed as much equipment as they needed.
How are they going to increase their adaptability to "environmental" events in the future, especially given the likelihood that they won't even own their infrastructure? Again, this is where various utility computing standards become important, including software and configuration portability.

I recommend reading Om Malik's piece (as referenced by Carr) to get an understanding why this is more important than ever. According to Malik, hosting everywhere is a house of cards, if only because of the aging power infrastructure that they rely on. Some people are commenting that geographic redundancy is unnecessary when using a top-tier hosting provider, but I think yesterday's events prove otherwise.

What I believe the future will bring is commoditized computing capacity that, in turn, will make the cost of distributing your application across the globe almost the same as running it all in one data center. You may be charged extra for "high availaility" service levels, but the cost won't be that significant, with the difference mostly going to the additional monitoring and service provided to guarantee those service levels.

Of course, that will require portability standards...

Wednesday, July 25, 2007

Why it all boils down to infrastructure management...

No comments: