Saturday, August 02, 2008

The Principles of a Cloud Oriented Architecture

The market is hot. The technologies are appearing fast and furious. The tools you need are out there, but they are young, often untested, and always deliver unpredictable reliability. You've researched the economics, and you know now that cloud computing is a) here to stay, and b) offers economic advantages that--if realized--could stretch you IT budget and quite possibly catapult your career.

Now what?

What is often overlooked in the gleeful rush to cloud computing is the difficulty in molding the early technologies in the space into a truly bulletproof (or even bullet-resistant) business infrastructure. You see it all over the Internet; the push and pull between innovation and reliability, the concerns about security, monitoring and control, even the constant confusion over what entails cloud computing, what technologies to select for a given problem, and how to create an enterprise-class business system out of those technologies.

The truth is, cloud computing doesn't launch our technical architectures into the future. It is, at its heart, an economic model that drives the parameters around how you acquire, pay for and scale the infrastructure architectures you already know. Its not a question of changing the required problems to solve when utilizing data centers, just a change to the division of responsibilities amongst yourself, your organization, your cloud providers and the Internet itself.

To this end, I offer you a series of posts (perhaps moving to a WIKI in the near future) describing in depth my research into what it takes to deliver a systems architecture with the following traits:
  1. It partially or entirely incorporates the clouds for at least one layer of the Infrastructure/Platform/Application stack.
  2. Is focused on consumers of cloud technologies, not the requirements of those delivering cloud infrastructures, either public or private (or even dark).
  3. Takes into account a variety of technical, economic and even political factors that systems running in the "cloud" must take into account.
  4. Is focused at least as much on the operational aspects of the system as the design and development aspects
The idea here is not to introduce an entirely new paradigm--that's the last thing we need given the complexity of the task ahead of us. Nor is it to replace the basic principles of SOA or any other software architecture. Rather, the focus of this series is on how to best prepare for the new set of requirements before us.

Think about it. We already deal (or try to deal) with a world in which we don't entirely have control over every aspect of the world our applications live in. If we are software developers, we rely on others to build our servers, configure our networks, provide us storage and weld them all together into a cohesive unit. System administrators are, in large enterprises anyway, specializing in OS/application stacks, networking, storage or system management. (Increasingly you can add facilities and traditional utilities to this list.)

Even when we outsource to others--shifting responsibility for management of parts or all of our IT infrastructure to a vendor--the vendor doesn't have control over significant elements of the end-to-end operations of our applications; namely, the Internet itself. But with outsourcing, we typically turn over entire, intact architecture stacks, with a few, very well bounded integration points to manage (if any) between outsourced systems and locally maintained systems.

The cloud is going to mess this up. I say this not just because the business relationship is different from outsourcing, but also because what you are "turning over" can be a *part* of a system stack. Smugmug outsources storage and job processing, but not the web experience that relies on both. Applications that run entirely on EC2/S3 outsource the entire infrastructure, but not the application development, or even the application system management. (This is why RightScale, Hyperic and others are finding some traction with AWS customers.)

To prepare for a cloud oriented architecture, one understand what responsibilities lie where. So, I'll give you a teaser of what is to come with the short-short version of where I see these responsibilities lying (subject to change as I talk to others, including yourselves if you choose to comment on this post):
  • The enterprise has responsibility for the following:
    • Defining the business solution to be solved, the use cases that define that solution, and the functional requirements to deliver those use cases
    • Evaluating the selection of technical and economic approaches for delivering those functional requirements, and selecting the best combination of the two. (In other words, the best combination may not contain either the best technical or best economic selection, but will outweigh any other combination of the two.)
    • Owning the service level agreements with the business for the delivery of those use cases. This is critically important. More on this below.
  • The cloud provider has responsibility for the following:
    • Delivering what they promised you (or the market) that they would deliver. No more, no less.
    • Providing you with transparent and honest billing and support services.
  • The Internet itself is only responsible for providing you with an open, survivability reliable infrastructure for interconnecting the networks you need to run your applications and/or services. There are no promises here about reliability or scalability or even availability. It should be considered a technical wilderness, and treated accordingly.
Now, about SLAs. Your cloud provider does not own your SLAs, you do. They may provide some SLAs that support your own, but they are not to be blamed if you fail to achieve the SLAs demanded of you. If your applications or services fail because the cloud failed, you failed. Given that, don't "outsource" your SLAs, at least not logically. Own them.

In fact, I would argue that the single most important function of a cloud-centric IT shop after getting required business functionality up and running in the first place, is monitoring and actively managing that functionality; switching vendors, if necessary, to continue service at required levels. The one big piece of IT-specific software that should always run in IT data centers, in my opinion, is the NOC infrastructure. (Although, perhaps in this context its more of a Cloud Operations Center, but I hate the resulting acronym for obvious reasons.)

I'll focus more on these responsibilities in future posts. All posts in this series will be tagged "coa principles". Please feel free to provide me feedback in the comments, contact me to review your thoughts on this topic, or simply to send me links that you think I should be aware of. I am also working to find other bloggers who wish to take ownerships of parts of this primer (cloud security, for example) so let me know if you are interested there as well.

I am excited about this. This body of knowledge (or at least the faint traces of knowledge) have been rattling inside my head for some time, and it feels good to finally be sharing them with you.