Thursday, May 24, 2007

Service Level Automation Deconstructed: Respond

For the third and last in my series breaking down the three key assumptions behind Service Level Automation, I would like to focus on how SLA environments can control data center configuration in response to service level goal violations. These goal violations and the high level actions to be taken are determined by the analysis capability of the environment. Details of how to accomplish those high level actions, however, are decided and executed by the response function.

Essentially, the response function of an SLA environment is very much like the driver set that your operating system uses to translate high level actions (e.g. "store this file") to device specific actions ("Move head 32 steps to center, find block 4D5EF, etc."). The responsibility here is to provide the interface between the SLA analysis engine and specific standard or proprietary interfaces to everything from server hardware to network switches to operating systems and middleware.

I see the following key interface points in today's environments:

  • Power Controllers/MPDUs: Job 1 of a service level automation environment is providing the resources required to meet the needs of the software environment, and only those resources. Turn those servers on when they are needed, and off when they are not. This includes virtual server hosts. (Examples: DRAC, iLO, RSA II,MPDUs)
  • Operating Systems: Before you shut off that server, make sure you've "gently" shut down its software payload. Well written server payloads for automated environments will both start up and aquire intial state (if any), and shut down while preserving any necessary state without human intervention. However, from a communications perspective, each action starts with the OS. (Examples: Red Hat, SuSE, MS Windows, Sun Solaris)
  • Middleware/Virtualization: It is interesting to note that many software payload components (e.g. an application server or a hypervisor) are both software to be managed, and computing capacity themselves. For example, an application server should be managed to specific service levels relating to its relationship with its host server (e.g. CPU utilization, thread counts, etc.), while also treated as a capacity resource for JavaEE applications and services. As such, these software containers should be managed for their own guest payloads much like a physical server would for the overall server payload. (BEA Weblogic, VMWare ESX, XenSource XenEnterprise)
  • Layer 2 Networking: In order to use a server to meet an application's needs, that server must have access to required networks. True automation requires that switch ports be reconfigured as necessary to ensure access to specifically the VLANs required by the payloads they will represent. (Examples: Cisco 3750, Extreme Summit400)
  • Network Attached Storage (NAS): The beauty of NAS devices is that they can be dynamically attached to a software payload at startup, without requiring any hardware configuration beyond the Layer 2 configuration described above. SAN is also useful (and common), but requires hardware configuration to make work. That complicates the role of automation. Part of the problem is the inconsistent remote configurability of fiber switches, which may be mitigated somewhat with iSCSI. However, NAS is quickly becoming the preferred storage mechanism in large data centers. (Examples: NetApp FAS, Adaptec Snap Server)

Over time, I see the industry adding more and more "drivers" to manage more and more data center (and perhaps desktop) resources. Imagine a world in which each software and/or hardware vendor produced standard SLA drivers for each individual component that makes up your data center environment. Every switch, disk and server; every service, container and OS; even every light bulb and air conditioner are connected to a single service level policy engine in which business policy (including cost of operations) drives automated decisions about their use and disuse.

Its not here yet, but you won't have to wait long...

I will use the label "respond" to tag posts related to response interfaces.

Wednesday, May 23, 2007

If your CIO doesn't "get it" yet, he soon will

Ken Oestreich posted a cool review of the Forrester Research 2007 IT Forum in Nashville, TN last week. I was fascinated by Forrester's contention that there are "no IT projects anymore, only business projects" (aka "BT"), but perhaps the most interesting observation of the day was the following:

However, the best talk IMHO came from Robert Beauchamp, CEO of BMC software. He's a very down-to-earth, articulate guy-even in front of 1,000 people. I was most impressed by his Shoemaker's Children analogy... that the IT (alright, BT) organizations in enterprises are arguably the least automated departments around. ERP is automated. Finance is automated. Customer interaction is automated. But IT is still manually glued-together, with operations costs continuing to outpace capital investments.

(Emphasis mine.)

This is a gorgeous observation; so simple, so articulate, and--most importantly--so true! I have always been amazed at the amount of manual labor that goes into delivering technology that makes some other schmuck's life more labor free. Programming is a great example of this. (Even with advances in code building, IDE templating and wizard-based programming, I bet the vast majority of developers out there still shudder at the term "code generation".)

Server provisioning (bare metal or virtual servers) is also a great example. In a prior life, I worked for one of the most forward thinking technology companies out there. However, when it came to pushing code to production, it was still a server-by-server hand install job. Provisioning 4 front end portal servers took anywhere from a couple of hours to a couple of days.

Another example is trouble ticket response. How many of the system operators out there still carry pagers, and are forced to get out of bed in the middle of the night to respond to a system event? If you say "not at my company", I bet you have overseas support to back you up overnight. The response remains completely manual.

That is why I am so excited about the Service Level Automation space, its role in utility computing and its role in automating IT processes. It is time this happens, and I hope your "BT" organization is considering it.

Monday, May 07, 2007

VMWare TSX and Reducing Complexity in the Data Center

I had a busy last few days of the week last week. On Wednesday, I attended VMWare TSX in Las Vegas, and on Friday, I had the chance to hear Bill Coleman speak about utility computing, and the events that have lead up to its sudden resurgence in the market place.

All in all, TSX was one of the most informative VMWare events I have ever attended. I only had the chance to attend three sessions--CPU scheduling, ESX networking and DRS/HA--but all three were packed with useful information. (The slides linked here are from a TSX conference in Nice, Italy, April 3-5, 2007. They are a little different from the slides I saw, but are similar enough to communicate the basic concepts.)

If you don't know much about VMWare CPU scheduling, check out that deck. Sure, its basic scheduling stuff, but it is very helpful when it comes to understanding how VMWare settings affect processor share. The networking deck is also critical if you must deploy network applications to virtual machines.

The DRS/HA deck has some helpful tips, but also clearly demonstrates the limited scale of DRS/HA. A 16 physical node limit per HA cluster, for example, is going to be problematic for most medium to large data centers. Furthermore, these are very server-centric technologies; the concept of Service Level Automation is clearly missing, as there is no concept at all of a service or application to be measured. They are hinting at a few new app-level monitors in a later release, but I just don't think monitoring service levels from a business perspective is very important to VMWare.

Bill's speech to the IT department of a large manufacturer was very interesting, if for no other reason than it clearly spelled out the argument for reducing complexity in the data center. (For a quick and dirty argument, see this article.) We are definitely at a cross roads now; IT can choose to attack complexity with people or technology. Most of us are betting technology will win. Furthermore, Bill told the assembled techies, the early adopters of any platform technology get the best jobs when that platform becomes mainstream. Almost nobody is predicting that utilty computing will fail in the long term, so now is the time to jump aboard and get involved.

I should have some time to complete the Service Level Automation Deconstructed series this week. Stay tuned for more.