Monday, February 26, 2007

When CPU utilization is not enough...

Where have I been, you might ask?

Much to my delight, the last few weeks have been filled with customer activity, ranging from helping a Service Level Automation-enabled appliance for a major software company, to assisting the financial wing of one of the world's largest manufacturers to experience first hand the benefits of utility computing.

The latter runs an application that is highly dependent on Windows Terminal Services to deliver a client-server UI to thousands of retail outlets world-wide. Uptime is critical to this application, as customers will go elsewhere for financing if this application doesn't confirm credit within minutes of a purchase decision.

Unfortunately, WTS is also a very inefficient consumer of server payload. It is a session-based infrastructure, which means that a user will be attached to a specific server for their desktop access until they either log out or are kicked off. If 15 user sessions share a physical server, there is no way to predict the load on the system. All 15 sessions could be idle, or all could quickly start consuming cycles simultaneously.

This gives me my first really good example of when CPU and/or memory utilization are not good Service Level metrics on their own. Imagine an environment using WTS to support hundreds of users. These users use their Windows sessions to run a variety of tasks, much like any Windows user. Some tasks use a high level of CPU and memory, others very little. Quite often, the session will sit idle for several minutes.

Now, if you create lots of sessions because the CPU is idle, you could end up with problems if they all get active at the same time (say right after lunch). If you stop creating sessions on a server because CPU utilization is high, you may end up with a highly under utilized server when one user's game of Quake wraps up.

That's not to say that CPU or memory utilization aren't an important part of the Service Level "equation". The truth is, there are several metrics that apply to WTS capacity: sessions, CPU utilization, memory utilization, licenses, etc. Since determining Service Level compliance probably involves evaluating the relationships between several of these metrics at once, there will most likely be one or two compound metrics based on mathematical equations combining these "root" metrics in a way that reasonable thresholds can be set.

Another interesting observation is that this is a lot like the Java EE Service Level Automation problem. (Thanks to Luis Cuyun for pointing this out.) While most horizontally scalable application tiers can be scaled up and down as a unit (i.e. "add a node/remove any node"), app servers, hypervisors and (now) WTS all must be monitored as a unit, but managed on a per server basis (in this case, "add a node/remove this specific node"). This is because the "instances" that each of these software resources are hosting are "sticky" to a server (VMotion not withstanding), and you don't want to shut down any server when capacity is not longer needed, you want to only remove the specific servers with no live sessions remaining on them.

(Speaking of VMotion, one of the things that both Java EE and WTS will require to be really optimizable is the ability to move live services/sessions from one server to another in real-time. Anyone know of a technology addressing either of these?)

The good news for a good Service Level Automation environment (*ahem*) is that if one of these problems (Java EE, virtual servers or WTS) is solved correctly, the same basic technology can be applied to all of them. That's not to say that anyone is doing this for WTS today (to my knowledge, no one is), but I like the idea that the use cases that apply to Java EE SLA also apply to WTS SLA.

I'm hoping to have more to write about this as this pilot continues. In the meantime, anyone with Windows experience is welcome and encouraged to contribute their two cents to this discussion. In particular, are there any tried and true service level metrics for WTS that are being used out there? In general, there are so many moving parts here, that I am sure there are many critical factors to Service Level Automation of WTS that I have not covered, or even considered.

1 comment:

govind said...

real service level automation is only possible when one can reliably say x app uses y resources at z load and in general takes w. This is very unreliable to prove right now becuase vendors/software writers take decision based on generic vs high end software and put artificial or "engineered" workarounds for resource limitations throughout the code.