Monday, July 30, 2007

SLAuto vs. SLA

A while back, Eric Westerkamp over at eCloudM asked a simple question that got me thinking. Eric wondered aloud (in print?) whether "using the Term Service Level Automation (SLA) causes confusion when presenting the ideas and topics into the business community. I have most often seen SLA refer to Service Level Agreements. While similar in concept, they are very different in implementation."

So, to keep things clear in my blog, I will now use the acronym SLAuto for Service Level Automation, and retain the SLA moniker for Service Level Agreements. I hope this eliminates confusion and allows the market to talk more freely about Service Level Automation.

Speaking of SLA, though, Steve Jones posted a great example of symbiotic relationship between the customer and the service provider in the SLA equation. To put it in the context of an enterprise data center, you could offer 100% up time, .1 sec response time and a 5 minute turnaround time, but it wouldn't be of any value if the customer's application was buggy, they were on a dial-up network and it took them six weeks to get the requirements right for a build-out.

Now, let's look at that in the context of SLAuto. To my eye, the service provider in an SLAuto environment is the infrastructure. The customer is any component or person that accesses or depends on any piece of that infrastructure. Thus, any SOA service can be a service provider in one context, and a customer in another. Even the policy engine(s) that automate the infrastructure can be thought of as a customer in the context of monitoring and management, and a service provider in terms of an interface for other customers to define service level parameters.

Steve's example hints that I could buy the "bestest", fastest, coolest high tech servers, switches and storage on the planet and it wouldn't increase my service levels if I couldn't deliver that infrastructure to the applications that required it quickly, efficiently and (most importantly) reliably. Or, for that matter, if those applications couldn't take advantage of it. So, if you're going to automate, your policy engine should be (you guessed it) quick, efficient and reliable. If it isn't, then your SLAs are limited by your SLAuto capabilities.

Not what you intended, I would think...

Wednesday, July 25, 2007

Why it all boils down to infrastructure management...

Digital Daily made my day today with the story of data center operator 365 Main's declaration of its amazing up time record followed hours later by complete loss of their San Francisco data center for a good hour--a data center populated by a veritable "who's who" of Silicon Valley tech companies. As a result of the SF power outage, a good portion of the Web 2.0 social network world was also down.

What does this have to do with Utility Computing? Well, as Nicholas Carr points out, it has everything to do with the key problem of computing today: not software, not social networks, but infrastructure. 365 Main is (reportedly) one of the best collocation hosting facilities in the world, and despite a bunch of power failure backup, including two sets of generators, they didn't get the job done. Worse yet, they were bitten by something they can do little about: old and tired power infrastructure.

When we talk about utility computing architectures and standards, we must remember that, despite infrastructure becoming a commodity, it is still the foundation on which all other computing services are reliant. If we can't provide the basic capabilities to keep servers running, or quickly replace them if they fail, then all of our pretty virtual data centers, software frameworks and business applications are worthless.

For 365 Main, their focus initially will probably be on why all that backup equipment didn't save their butts as it was designed (and purchased) to do. I mean, as much as I would like to make this a story about how great Service Level Automation software would have saved the day, I have to admit that even my employer's software can't make software run on a server with no electricity.

However, and perhaps more importantly, the operations teams of those Silicon Valley companies that found themselves without service for this hour or so must ask themselves some key questions:
  1. Why weren't their systems distributed geographically, so that some instances of every service would remain live even if an entire data center were lost? If the problem was the architectures of the applications, that is a huge red flag that they just aren't putting the investment into good architecture that they need to. If the problem is the cost/difficulty of replicating geographically, then I believe Service Level Automation/utility computing software can help here by decoupling software images from the underlying hardware (virtual or physical).
  2. Why couldn't they have failed over to a DR site in a matter of minutes? Using the same utility computing software and some basic data replication technologies, any one of these vendors could have kept their software and data images in sync between live and DR instances, and fired up the whole DR site with one command. Heck, they could even have used another live hosting facility/lab to act as the DR site and just repurposed as much equipment as they needed.
  3. How are they going to increase their adaptability to "environmental" events in the future, especially given the likelihood that they won't even own their infrastructure? Again, this is where various utility computing standards become important, including software and configuration portability.

I recommend reading Om Malik's piece (as referenced by Carr) to get an understanding why this is more important than ever. According to Malik, hosting everywhere is a house of cards, if only because of the aging power infrastructure that they rely on. Some people are commenting that geographic redundancy is unnecessary when using a top-tier hosting provider, but I think yesterday's events prove otherwise.

What I believe the future will bring is commoditized computing capacity that, in turn, will make the cost of distributing your application across the globe almost the same as running it all in one data center. You may be charged extra for "high availaility" service levels, but the cost won't be that significant, with the difference mostly going to the additional monitoring and service provided to guarantee those service levels.

Of course, that will require portability standards...

Friday, July 20, 2007

Where's the standard, bub?

Simon Wardley's Bits and Pieces blog has an interesting post breaking down what he thinks are the three key markets for utility computing:
  • Saas: Software as a Service

  • FaaS: Frameworks as a Service

  • HaaS: Hardware as a Service
Such as it goes, this is OK, but I think his most interesting comments surrounded the need for Common Service Providers in each of these areas, and the need for portability across those providers, and the mechanisms that he thinks will drive the standards that will enable portability. To quote:

The issues and the needs of a competitive utility computing market are also the same at each level - portability, multi-providers and agreed standards and solves the same class of problems - disaster recovery, scalability, efficiency and exist costs.

In today's world the fastest way to achieve a standard is not through committee, conversation or whitepapers but through the release and adoption of not only a standard but also an operational means of achieving a standard.

Hence such utility computing standards will only be achieved through the use of open source, without any one CSP being strategically disadvantaged to any owner of the standard.

To be sure, this is controversial, but it aligns nicely with an observation that I and others have had about Xen, and why it may struggle to supersede VMWare, despite being freely distributed by just about every major OS vendor out there.

One of my colleagues put it best in an email:

As to the larger question of why Xen is failing miserably, I would like to profess this opinion -- Storage. The Xen / KVM / Linux / RedHat community botched storage. Hence, they are failing in the marketplace.

To elaborate:

Virtualization brings two important benefits:
  1. Seperates the OS+App stack from the underlying hardware

  2. Enables you to package the OS+App stack into a VM that you can fling around with ease....this is the storage angle.
Xen accomplished (1) reasonably well.

Xen failed miserably with (2), i.e Storage. VMware solved the storage issues admirably well. So long as Xen solutions do not work well in the "copy virtual disk files around and run them anywhere" model, Xen will not succeed.
In other words, Xen has issues with portability by not providing a file representation of virtual machine storage that can be moved between disparate physical systems with ease. I know there are some virtual appliance vendors out there that do Xen, so maybe the problem is solved with their technologies. However, there is no standard proposed by Xen, and thus there is no portable standard for Xen VMs as files.

Alas, VMWare has a nice portable file representation of a VM. Granted, there are portability issues there, as well, but by and large VMWare has a much better solution to portability--within VMWare hosts. Unfortunately, there is still no solution (that I know of) that will run the same file system on both VMWare and Xen. Thus, no universal portability is coming soon from the VM space.

Recently, I have been telling anyone who will listen that this nascent utility computing market is still searching for a standard for server (VM/framework/application/whatever) portability across disparate utility computing service providers. I like the concept of a virtual appliance, but we need a (non-proprietary) standard, or we need another portability mechanism besides VMs. (As a side note to my new friends at eCloudM--this is definitely an opportunity, though it may not meet your criteria.)

Otherwise, utility computing will be "choose your vendor and build your software accordingly", not "build your software as you like and choose any vendor you want".

Please, if I am way off, correct me with a comment or a blog post. I would love to find out I am wrong about this...

A Helping Hand Comes In Handy Sometimes

You may remember my recent post on how data centers resemble complex adaptive systems. This description of a data center has a glaring difference from a true definition of complex adaptive systems, however; data centers require some form of coordinated management beyond what any single entity can provide. In a truly complex adaptive system, there would be no "policy engines" or even Network Operations Centers. Each server, each switch, each disk farm would attempt to adapt to its surroundings, and either survive or die.

Therein lies the problem, however. Unlike a biological system, or the corporate economy, or even a human society, data centers cannot afford to have one of its individual entities (or "agents" in complex systems parlance) arbitrarily disappear from the computing environment. It certainly cannot rely on "trial and error" to determine what survives and what doesn't. (Of course, in terms of human management of IT, this is often what happens, but never mind...)

Adam Smith called the force that guided selfish individuals to work together for the common benefit of the community the "invisible hand". The metaphor is good for explaining how decentralized adaptive systems can organize for the greater good without a guiding force, but the invisible hand depends on the failure of those agents who don't adapt.

Data centers, however, need a "visible hand" to quickly correct some (most?) agent failures. To automate and scale this, certain omnipotent and omnipresent management systems must be mixed into the data center ecology. These systems are responsible for maintaining the "life" of dying agents, particularly if the agents lose the ability to heal themselves.

Now, a topic for another post is the following: can several individual resource pools, each with their own policy engine, be joined together in a completely decentralized model?

Sunday, July 08, 2007

Web 2.0, Utility Computing and Service Level Automation

Industry Girl has an interesting observation of what it is that is driving utility computing. She believes that it is the drive towards "Web 2.0 sites and applications, like the video on YouTube or the social networking pages on MySpace" is creating huge demand on backend server infrastructure--unpredictable demand, I may add--which, in turn, is creating the need for truly dynamic capacity allocation. Add to that the trend of Web 2.0 technologies being used by more and more commercial and public organizations, and you begin to see why it's time to turn your IT into a utility.

I have to say I agree with her, but I would like to observe that this is only a (large) piece of the overall picture. In reality, most of the sites she specifically mentions are actually "Software as a Service" sites targeted at consumers rather than businesses. Its the trend towards getting your computing technology over the Internet in general that is the real driving need.

For "utility computing" plays such as SaaS companies, managed hosting vendors, booksellers :), and others, the need for utility computing isn't just the need to find capacity, it is also the need to control capacity. This, in turn, means intelligent, policy-based systems, that can deliver capacity to where it is needed, share capacity among all compatible software systems and meter capacity usage in enough detail to allow the utility to constantly optimize "profit" (which may or may not be financial gain for the capacity provider itself).

Service Level Automation, anyone?

Web 2.0 drives utility computing, which in turn drives service level automation. So, Industry Girl, I welcome your interest in utility computing, and offer that the extent to which utility computing is successful is the extent by which such infrastructure delivers the functionality required at the service levels demanded. Welcome to our world...