Thursday, July 24, 2008

Cloud Outages, and Why *You* Have To Design For Failure

I haven't posted for a while because I have been thinking...a lot...about cloud computing, inevitable data center outages, and what it means to application architectures. Try as I might to put the problem on the cloud providers, I keep coming back to one bare fact; the cloud is going to expose a lot of the shortcomings of today's distributed architectures, and this time it's up to us to make things right.

It all started with some highly informative posts from the Data Center Knowledge blog chronicling outages at major hosting companies, and failures that helped online companies learn important lessons about scaling, etc. As I read these posts, the thought that struck my mind was, "Well, of course. These types of things are inevitable. Who could possibly predict every possible negative influence on an application, much less a data center." I've been in enough enterprise IT shops to know that even the very best are prepared for something unexpected to happen. In fact, what defines the best shops are that they assume failure and prepare for it.

Then came the stories of disgruntled employees locking down critical information systems or punching the emergency power kill switch on their way out the door. Whether or not you are using the cloud, human psychology being what it is, we have to live every day with immaturity or even just plain insanity.

Yet, each time one of the big name cloud vendors has an outage--Google had one, as did Amazon a few times, including this weekend--there are a bunch of IT guys crying out, "Well, there you go. The cloud is not ready for production."

Baloney, I say. (Well, I actually use different vocabulary, but you get the drift.) Truth is, the cloud is just exposing people's unreasonable expectations for what a distributed, disparate computing environment provides. The idea that some capacity vendor is going to give you 100% up time for years on end--whether they promised it or not--is just delusional. Getting angry at your vendor for an isolated incident or poo-pooing the market in general just demonstrates a lack of understanding of the reality of networked applications and infrastructure.

If you are building an application for the Internet--much less the cloud--you are building a distributed software system. A distributed system, by definition, relies on a network for communication. Some years ago, Sun's Peter Deutsch and others at Sun postulated a series of fallacies that tend to be the pitfalls that all distributed systems developers run into at one time or another in their career. Hell, I still have to check my work against these each and every time I design a distributed system.

Key among these is the delusion that the network is reliable. It isn't, it never has been, and it never will be. For network applications, great design is defined by the application or application system's ability to weather undesirable states. There are a variety of techniques for achieving this, such as redundancy and caching, but I will dive into those in more depth in a later post. (A great source for these concepts is http://highscalability.com.)

Some of the true pioneers in the cloud realized this early. Phil Wainwright notes that Alan Williamson of Mediafed made what appears to be a prescient decision to split their processing load between two cloud providers, Amazon EC2/S3 and FlexiScale. Even Amazon themselves use caching to mitigate S3 outages on their retail sites (see bottom of linked post for their statement).

Michael Hickins notes in his E-Piphanies blog that this may be an amazing opportunity for some skilled entrepreneurs to broker failure resistance in the cloud. I agree, but I think good distributed system hygiene begins at home. I think the best statement is a comment I saw on ReadWriteWeb:
"People rankled about 5 hours of downtime should try providing the same level of service. In my experience, it's much easier to write-off your own mistakes (and most organizations do), than it is to understand someone else's -- even when they're doing a better job than you would."
Amen, brother.

So, in a near future post I'll go into some depth about what you can do to utilize a "cloud oriented architecture". Until then, remember: Only you can prevent distributed application failures.