Showing posts with label measure. Show all posts
Showing posts with label measure. Show all posts

Thursday, May 22, 2008

Cassatt Announces Active Response 5.1 with Demand Base Policies

Ken Oestreich blogged recently about the very cool, probably landmark release of Cassatt that just became available, Cassatt Active Response 5.1. He very eloquently runs down the biggest feature--demand based policies--so I won't repeat all of that here. What I thought I would do instead is relate my personal thoughts on monitoring based policies and how they are the key disruptive technology for data centers today.

To be sure, everyone is talking about server virtualization in the data center market today, and that's fine. It's core short-term benefit, physical system consolidation and increased utilization is key for cost-constrained IT departments, and features such as live motion and automatic backup are creating new opportunities that should be carefully considered. However, virtualization alone is limited in its applications, and does little to actually optimize a data center over time. (This is why VMWare is emphasizing management over just virtualizing servers these days.)

The technology that will make the long term difference is resource optimization: applying automation technologies to tuning how and when physical and virtual infrastructure is used to solve specific business needs. It is the automation software that will really change the "deploy and babysit" culture of most data centers and labs today. The new description will be more like "deploy and ignore".

To really optimize resource usage in real time, the automation software must use a combination of monitoring (aka "measure"), a policy engine or other logic system (aka "analyze") and interfaces to the control systems of the equipment and software it is managing (aka "respond"). It turns out that the "respond" part of the equation is actually pretty straight forward--lots of work, but straight forward. Just write "driver" like components that know how to talk to various data center equipment (e.g. Windows, DRAC, Cisco NX-OS, NetApp Data ONTAP, etc.), as well as handle error conditions by directly responding or forwarding the information to the policy engine.

The other two, however, require more immediate configuration by the end user. Measure and analyze, in fact, are where the entire set of Service Level Automation (SLAuto) parameters are defined and executed on. So, this is where the key user interface between the SLAuto system and end user has to happen.

What Cassatt has announced is a new user interface to define demand based policies as the end user sees fit. For example, what defines an idle server? Some systems use very little CPU while they wait for something to happen (at which point they get much busier), so simply measuring CPU isn't good enough in those cases. Ditto for memory in systems that are compute intensive but handle very little state.

What Cassatt did that is so brilliant (and so unique) is to allow the end user to leverage the full range of SNMP attributes for their OS, as well as JMX and even scripts running on the monitored system to create expressions that define an idle metric that is right for that system. For example, on a test system you may in fact say that a system is idle when the master test controller software indicates that no test is being run on that box. On another system, you may say its idle when no user accounts are currently active. Its up to you to define when to attempt to shut down a box, or reduce capacity for a scale-out application.

Even when such an "idle" system is identified, Cassatt gives you the ability to go further and write some "spot checks" to make sure they system is actually OK to shut down. For example, in the aforementioned test system, Cassatt may determine that its worth trying to power down a system, but a spot check could be run to determine if a given process is still running, or an administrator account is currently actively logged in to the box that would indicate to Cassatt that it should ignore that system for now.

I know of no one else that has this level of GUI configurable monitor/analyze/respond sophistication today. If anyone wants to challenge that, feel free. Now that I no longer work at Cassatt, I'd be happy to learn about (and write about) alternatives in the marketplace. Just remember that it has to be easy to configure and execute these policies, and scripting the policies themselves is not good enough.

It is clear from the rush to release resource optimization products for the cloud, such as RightScale, Scalr, and others, that this will be a key feature for distributed systems moving forward. In my opinion, Cassatt has launched itself into the lead spot for on premises enterprise utility computing. I can't wait to see who responds with the next great advancement.

Disclaimer: I am a Cassatt shareholder (or soon will be).

Tuesday, September 04, 2007

An easy way to get started with SLAuto

It's been an interesting week, leading up to the Labor Day weekend, but as of this morning I get to talk more openly about one project that has been taking a great deal of my time. As I have blogged about Service Level Automation ("SLAuto"), it may have dawned on some of you that achieving nirvana here means changing a lot about your current architecture and practices.

For example, decoupling software from hardware is easy to say, but requires significant planning and execution to implement (though this can be simplified somewhat with the right platform). Building the correct monitors, policies and interfaces is also time intensive work that requires the correct platform for success. However, as noted before, the biggest barriers to implementing SLAuto and utility computing are cultural.

There is an opportunity out there right now to introduce SLAuto without all of the trappings of utility computing, especially the difficult decoupling of software from hardware. It is an opportunity that the Silicon Valley is going ga-ga over, and it is a real problem with real dollar costs for every data center on the planet.

The opportunity is energy consumption management, aka the "green data center".

Rather than pitch Cassatt's solution directly, I prefer to talk about the technical opportunity as a whole. So let's evaluate what is going on in the "GDC" space these days. As I see it, there are three basic technical approaches to "green" right now:

  1. More efficient equipment, e.g. more power efficient chips, server architectures, power distribution systems, etc.
  2. More efficient cooling, e.g. hot/cold aisles, liquid cooling, outside air systems, etc.
  3. Consolidation, e.g. virtualization, mainframes, etc.

Still, there is something obvious missing here: no matter which of these technologies you consider, not one of them is actually going to turn off unused capacity. In other words, while everyone is working to build a better light bulb or to design your lighting so you need fewer bulbs, no one is turning off the lights when no-one is in the room.

That's where SLAuto comes in. I contend that there are huge tracks of computing in any large enterprise where compute capacity runs idle for extended periods. Desktop systems are certainly one of the biggest offenders, as are grid computing environments that are not pushed to maximum capacity at all times. However, possibly the biggest offender in any organization that does in-house development, extensive packaged system customization or business system integration is the dev/test environment.

Imagine such a lab where capacity that will be unused each evening/weekend, or for all but two weeks of a typical development cycle, or at all times except when testing a patch to a three year old rev of product, was shut down until needed. Turned off. Non-operational. Idle, but not idling.

Of course, most lab administrators probably feel extremely uncomfortable with this proposition. How are you going to do this without affecting developer/QA productivity? How do you know its OK to turn off a system? Why would my engineers even consider allowing their systems to be managed this way?

SLAuto addresses these concerns by simply applying intelligence to power management. A policy-based approach means a server can be scheduled for shutdown each evening (say, at 7PM), but be evaluated before shutdown against a set of policies that determine whether it is actually OK to complete the shut down.

Some example policies might be:

  • Are certain processes running that indicate a development/build/test task is still underway?
  • Is a specific user account logged in to the system right now?
  • Has disk activity been extremely low for the last four hours?
  • Did the owner of the server or one of his/her designated colleagues "opt-out" of the scheduled shutdown for that evening?

Once these policies are evaluated, we can see if the server meets the criteria to be shut down as requested. If not, keep it running. Such a system needs to also provide interfaces for both the data center administrators and the individual server owners/users to control the power state of their systems at all times, set policies and monitor power activities for managed servers.

I'll talk more about this in the coming week, but I welcome your input. Would you shut down servers in your lab? Your grid environment? Your production environment? What are your concerns with this approach? What policies come to mind that would be simple and/or difficult to implement?

Monday, April 30, 2007

What do SOA and EDA have to do with SLA?

I've been launching off of Todd Biske's blog roll into the world of SOA and EDA blogging. I'm actually kind of saddened that my voyage into the world of infrastructure automation has pulled me so far from a world in which I was an early practitioner. (My career at Forte Software introduced me to service-oriented architectures and event-based systems long before even Java took off.) I love what the blogging world is doing for software architecture (and has been doing for some time now), and I feel like a kid in a candy store with all the cool ideas I've been running across.

One blog that has been capturing my interest is Jack van Hoof's "SOA and EDA". I love a blog with real patterns, term definition, and a passion for its subject matter. All put together by someone who can get an article published.

The article is actually very interesting to me from a Service Level Automation perspective. Jack captures his thoughts on the importance of building agile software architectures in the following paragraph:

Everything is moving toward on-demand business where service providers react to impulses - events - from the environment. To excel in a competitive market, a high level of autonomy is required, including the freedom to select the appropriate supporting systems. This magnified degree of separation creates a need for agility; a loose coupling between services so as to support continuous, unimpeded augmentation of business processes in response to the changing composition of the organizational structure.

(Emphasis mine.)

The only thing I would change about Jack's statement above is replacing the words "a loose coupling between services" to "a loose coupling between services and between services and infrastructure" and changing "composition of the organizational structure" to "composition of the organizational structure and infrastructure environment". (Some may have issues with the latter, but I don't mean that services should be written with specific technology in mind--just the opposite; they should be written with an eye towards technology independence.)

This is why I have been emphasizing lately the need to view the measure activity through the lens of both business and technical measures. Some of the business events thrown by an EDA may very well indicate the need to change the infrastructure configuration (e.g. if the stock market sees a 20% rise in volume in the matter of three minutes, someone may want to add capacity to those trading systems). However, the technical events from a software system (e.g. thread counts or I/O latency) may also indicate the need to change infrastructure configuration on the fly.

I wish I could spend more time collaborating with SOA architects and "tacticians". In fact, I have been speaking with Ken Oestreich about exactly this. If you are in the SOA space, and interested in talking about how SOA, EDA and SLA interconnect, let me know by commenting below. (Be sure to let me know how to contact you.) At the very least, think about how infrastructure will measure the performance of your software systems as you start your next development iteration.

Thursday, March 29, 2007

Service Level Automation Deconstructed: Measuring Service Levels

This is the first of three in my series analyzing the key assumptions behind Service Level Automation. Specifically, today I want to focus on the measurement of business systems, and the concepts behind translating those measurements into service level metrics.

Rather than trying to do an exhaustive coverage of this topic (and the other topics in this series) in a single post, what I am going to do is provide a "first look" post now, then use labels when followup posts have relative information. The label for this topic will be "measure".

In my next installment, I'll introduce analysis of those metrics against service level objectives (SLO) the business requires. That post, and future related posts will be labeled with "analyze".

In the final installment of the series, I'll describe the techniques and technologies available to digitally manipulate these systems so that they run within SLO parameters. Posts related to that topic will be labeled "respond".

As noted earlier, my objective is to survey the technologies, academics, etc., of each of these topics in an attempt to enlighten you about the science and technology that enables service level automation.

How do we measure quality of service?

Measuring quality of service is a complex problem, not so much because it is hard to measure information systems and business functionality. I (and I bet you) could list dozens of technical measurements that can be made on an application or service that would reflect some aspect of its current health. For example:

  • System statistics such as CPU utilization or free disk space, as reported by SNMP
  • Response to a ping or HTTP request
  • Checksum processing on network data transfers
  • Any of dozens of Web Services standards

The real problem is that human perception of quality of service isn't (typically) based on any one of these measurements, but on a combination of measurements, where the specific combination may change based on when and how a given business function is being used.

For example, how do you measure the utilization of a Citrix environment? Measuring sessions/instances is a good start, but--as noted before with WTS--what happens when all sessions consume a large amount of CPU at once? CPU utilization, in turn, could fluctuate wildly as sessions are more or less active. Then again, what about memory utilization or I/O throughput? These could become critical completely independently from the others already mentioned.

No, what is needed is more mathematical--one (or a couple of) index(es) of sorts generated from a combination of the base metrics retrieved from the managed system.

There are tools that do this. They range from the basic capabilities available in a good automation tool, to the sophisticated evaluation and computation available in a more specialized monitoring tool.

What I am still searching for are standard metrics being collected by these tools, especially industry standard metrics and/or indexes that demonstrate the health of a datacenter or its individual components. I'll talk more about what I find in the future, but welcome you to contribute here with links, comments, etc. to point me in the right direction.

Thursday, March 22, 2007

Service Level Automation Deconstructed: Introduction

Service Level Automation starts with three simple premises:

* The factors contributing to software service quality can be measured electronically.

* Runtime targets indicating high quality of service can be defined for those measurements.

* Systems involved in delivering software functionality can be manipulated to keep those measurements within the runtime targets.

I think the support for each of these premises should be explored more deeply, so I plan to begin a little survey of the technologies and academics over the next few weeks. The idea is to get a good sense of what standards/technologies/concepts/etc. can be used to meet the requirements of each premise. I also hope to discuss how a system smart enough to take advantage of them(*) can save a large datacenter both in terms of direct costs, as well as in losses due to service level failures.

Why Service Level Automation? I wrote about this earlier. However, as a quick reminder, think of service level automation as meeting this objective:

Delivering the quantity and quality of service flow required by the business using the minimum resources required to do so.

I've been quite busy both at work and at home, so I'm hoping to use this exercise as a way to increase my posting frequency. Stay tuned for more.