Monday, June 30, 2008

Why cloud computing doesn't get us out of the woods yet...

Jesse Robbins (a modern Renaissance man if ever there was one) quoted from a post by Theo Schlossnagle, author of Scalable Internet Architectures and President and CEO of OmniTI, in which Schlossnagle notes the challenge brought by highly popular sites linking to average traffic sites, and its implications for scalable Internet architectures, including cloud computing.

As he carefully documents (using his blog site and two events triggered by Digg and the New York Times respectfully), the nature of "spikes" in the Internet has changed dramatically. First, he shows a graph of traffic to his blog over a two day period in March of 2008:Then he goes on to point out:
"What isn't entirely obvious in the above graphs? These spikes happen inside 60 seconds. The idea of provisioning more servers (virtual or not) is unrealistic. Even in a cloud computing system, getting new system images up and integrated in 60 seconds is pushing the envelope and that would assume a zero second response time. This means it is about time to adjust what our systems architecture should support. The old rule of 70% utilization accommodating an unexpected 40% increase in traffic is unraveling. At least eight times in the past month, we've experienced from 100% to 1000% sudden increases in traffic across many of our clients."
Stop and pay attention to that. The onset of traffic to near peak levels can take place in less than 60 seconds!

Sure, you can get "unlimited" capacity from an Amazon/Mosso/GoGrid/whatever on demand, but can you provision that extra capacity fast enough to meet demand? Clearly automation is not enough to guarantee that you will never lose a user. If this is critical to you, then running at a reduced utilization is probably the only really good answer. (Another possibility is implementing "warm" systems that primarily do another task, but that can be enslaved into a high traffic situation with little or no manual intervention--and don't require a reboot.)

I'm not sure I have a great answer for what to do here, but I think anyone buys into "capacity on demand" should know that this capacity takes time to allocate, and that demand may outstrip supply for seconds or minutes. Nothing about the cloud can avoid the trade off between utilization and "reactability".

Update: The folks at Project Caroline ran some simple tests, and feel they would be ready for this scenario. Ron Mann spells it all out for you, and it is an interesting read.