Saturday, August 30, 2008

Elements of a Cloud Oriented Architecture

In my post, The Principles of Cloud Oriented Architectures, I introduced you to the concept of a software system architecture designed with "the cloud" in mind:
"...I offer you a series of posts...describing in depth my research into what it takes to deliver a systems architecture with the following traits:
  1. It partially or entirely incorporates the clouds for at least one layer of the Infrastructure/Platform/Application stack.
  2. Is focused on consumers of cloud technologies, not the requirements of those delivering cloud infrastructures, either public or private (or even dark).
  3. Takes into account a variety of technical, economic and even political factors that systems running in the "cloud" must take into account.
  4. Is focused at least as much on the operational aspects of the system as the design and development aspects
The idea here is not to introduce an entirely new paradigm--that's the last thing we need given the complexity of the task ahead of us. Nor is it to replace the basic principles of SOA or any other software architecture. Rather, the focus of this series is on how to best prepare for the new set of requirements before us."
I followed that up with a post (well, two really) that set out to define what our expectations of "the cloud" ought to be. The idea behind the Cloud Computing Bill of Rights was not to lay out a policy platform--though I am flattered that some would like to use it as the basis of one-- but rather to set out some guidelines about what cloud computing customers should anticipate in their architectures. In this continuing "COA principles" series, I intend to lay out what can be done to leverage what vendors deliver, and design around what they fail to deliver.

With that basic framework laid out, the next step is to break down what technology elements need to be considered when engineering for the cloud. This post will cover only the list of some such elements as I understand them today (and feel free to use the comments below to add your own insights), and future posts will provide a more thorough analysis of individual elements and/or related groups of elements. The series is really very "stream of consciousness", so don't expect too much structure or continuity.

When considering what elements matter in a Cloud Oriented Architecture, we consider first that we are talking about distributed systems. Simply utilizing to do your Customer Relationship Management doesn't require an architecture; integrating it with your SAP billing systems does. As your SAP systems most likely don't run in data centers, the latter is a distributed systems problem.

Most distributed systems problems have just a few basic elements. For example:
  • Distribution of responsibilities among component parts

  • Dependency management between those component parts

  • Scalability and reliability

    • Of the system as a whole
    • Of each component

  • Data Access and Management

  • Communication and Networking

  • Monitoring and Systems Management

However, because cloud computing involves leveraging services and systems entirely outside of the architect's control, several additional issues must be considered. Again, for example:
  • How are the responsibilities of a complex distributed system best managed when the services being consumed are relatively fixed in the tasks they can perform?

  • How are the cloud customer's own SLA commitments best addressed when the ability to monitor and manage components of the system may be below the standards required for the task?

  • How are the economics of the cloud best leveraged?

    • How can a company gain the most work for the least amount of money?
    • How can a company leverage the cloud marketplace for not just cost savings, but also increased availability and system performance?

In an attempt to address the more cloud-specific distributed systems architecture issues, I've come up with the following list of elements to be addressed in a typical Cloud Oriented Architecture:
  • Service Fluidity - How does the system best allow for static redeployment and/or "live motion" of component pieces within and across hardware, facility and network boundaries? Specific issues to consider here include:

    • Distributed application architecture, or how is the system designed to manage component dependencies while allowing the system to dynamically find each component as required? (Hint: this problem has been studied thoroughly by such practices as SOA, EDA, etc.)
    • Network resiliency, or how does the system respond to changes in network location, including changes in IP addressing, routing and security?

  • Monitoring - How is the behavior and effectiveness of the system measured and tracked both to meet existing SLAs, as well as to allow developers to improve the overall system in the future? Issues to be considered here include:

    • Load monitoring, or how do you measure system load when system components are managed by multiple vendors with little or know formal agreements of how to share such data with the customer or each other?
    • Cost monitoring, or how does the customer get a accurate accounting of the costs associated with running the system from their point of view?

  • Management - How does the customer configure and maintain the overall system based on current and ongoing technical and business requirements? Examples of what needs to be considered here includes:

    • Cost, or what adjustments can be made to the system capacity or deployment to provide the required amount of service capacity at the lowest cost possible? This includes ways to manage the efficiency of computation, networking and storage.
    • Scalability, or how does the system itself allow changes to capacity to meet required workloads? These changes can happen:
      • vertically (e.g. get a bigger box for existing components--physically or virtually)
      • horizontally (e.g. add or remove additional instances of one or more components as required)
      • from a network latency perspective (adjust the ways in which the system accesses the network in order to increase overall system performance)
    • Availability, or how does the system react to failure or any one component, or any group of components (e.g. when an entire vendor cloud goes offline)?

  • Compliance - How does the overall system meet organizational, industry and legislative regulatory requirements--again, despite being made up of components from a variety of vendors who may themselves provide computing in a variety of legal jurisdictions?

Now comes the fun of breaking these down a bit, and talking about specific technologies and practices that can address them. Please, give me your feedback (or write up your criticism on your own blog, but link here so I can find you). Point me towards references to other ways to think about the problem. I look forward to the conversation.