Showing posts with label analyze. Show all posts
Showing posts with label analyze. Show all posts

Thursday, May 22, 2008

Cassatt Announces Active Response 5.1 with Demand Base Policies

Ken Oestreich blogged recently about the very cool, probably landmark release of Cassatt that just became available, Cassatt Active Response 5.1. He very eloquently runs down the biggest feature--demand based policies--so I won't repeat all of that here. What I thought I would do instead is relate my personal thoughts on monitoring based policies and how they are the key disruptive technology for data centers today.

To be sure, everyone is talking about server virtualization in the data center market today, and that's fine. It's core short-term benefit, physical system consolidation and increased utilization is key for cost-constrained IT departments, and features such as live motion and automatic backup are creating new opportunities that should be carefully considered. However, virtualization alone is limited in its applications, and does little to actually optimize a data center over time. (This is why VMWare is emphasizing management over just virtualizing servers these days.)

The technology that will make the long term difference is resource optimization: applying automation technologies to tuning how and when physical and virtual infrastructure is used to solve specific business needs. It is the automation software that will really change the "deploy and babysit" culture of most data centers and labs today. The new description will be more like "deploy and ignore".

To really optimize resource usage in real time, the automation software must use a combination of monitoring (aka "measure"), a policy engine or other logic system (aka "analyze") and interfaces to the control systems of the equipment and software it is managing (aka "respond"). It turns out that the "respond" part of the equation is actually pretty straight forward--lots of work, but straight forward. Just write "driver" like components that know how to talk to various data center equipment (e.g. Windows, DRAC, Cisco NX-OS, NetApp Data ONTAP, etc.), as well as handle error conditions by directly responding or forwarding the information to the policy engine.

The other two, however, require more immediate configuration by the end user. Measure and analyze, in fact, are where the entire set of Service Level Automation (SLAuto) parameters are defined and executed on. So, this is where the key user interface between the SLAuto system and end user has to happen.

What Cassatt has announced is a new user interface to define demand based policies as the end user sees fit. For example, what defines an idle server? Some systems use very little CPU while they wait for something to happen (at which point they get much busier), so simply measuring CPU isn't good enough in those cases. Ditto for memory in systems that are compute intensive but handle very little state.

What Cassatt did that is so brilliant (and so unique) is to allow the end user to leverage the full range of SNMP attributes for their OS, as well as JMX and even scripts running on the monitored system to create expressions that define an idle metric that is right for that system. For example, on a test system you may in fact say that a system is idle when the master test controller software indicates that no test is being run on that box. On another system, you may say its idle when no user accounts are currently active. Its up to you to define when to attempt to shut down a box, or reduce capacity for a scale-out application.

Even when such an "idle" system is identified, Cassatt gives you the ability to go further and write some "spot checks" to make sure they system is actually OK to shut down. For example, in the aforementioned test system, Cassatt may determine that its worth trying to power down a system, but a spot check could be run to determine if a given process is still running, or an administrator account is currently actively logged in to the box that would indicate to Cassatt that it should ignore that system for now.

I know of no one else that has this level of GUI configurable monitor/analyze/respond sophistication today. If anyone wants to challenge that, feel free. Now that I no longer work at Cassatt, I'd be happy to learn about (and write about) alternatives in the marketplace. Just remember that it has to be easy to configure and execute these policies, and scripting the policies themselves is not good enough.

It is clear from the rush to release resource optimization products for the cloud, such as RightScale, Scalr, and others, that this will be a key feature for distributed systems moving forward. In my opinion, Cassatt has launched itself into the lead spot for on premises enterprise utility computing. I can't wait to see who responds with the next great advancement.

Disclaimer: I am a Cassatt shareholder (or soon will be).

Tuesday, September 04, 2007

An easy way to get started with SLAuto

It's been an interesting week, leading up to the Labor Day weekend, but as of this morning I get to talk more openly about one project that has been taking a great deal of my time. As I have blogged about Service Level Automation ("SLAuto"), it may have dawned on some of you that achieving nirvana here means changing a lot about your current architecture and practices.

For example, decoupling software from hardware is easy to say, but requires significant planning and execution to implement (though this can be simplified somewhat with the right platform). Building the correct monitors, policies and interfaces is also time intensive work that requires the correct platform for success. However, as noted before, the biggest barriers to implementing SLAuto and utility computing are cultural.

There is an opportunity out there right now to introduce SLAuto without all of the trappings of utility computing, especially the difficult decoupling of software from hardware. It is an opportunity that the Silicon Valley is going ga-ga over, and it is a real problem with real dollar costs for every data center on the planet.

The opportunity is energy consumption management, aka the "green data center".

Rather than pitch Cassatt's solution directly, I prefer to talk about the technical opportunity as a whole. So let's evaluate what is going on in the "GDC" space these days. As I see it, there are three basic technical approaches to "green" right now:

  1. More efficient equipment, e.g. more power efficient chips, server architectures, power distribution systems, etc.
  2. More efficient cooling, e.g. hot/cold aisles, liquid cooling, outside air systems, etc.
  3. Consolidation, e.g. virtualization, mainframes, etc.

Still, there is something obvious missing here: no matter which of these technologies you consider, not one of them is actually going to turn off unused capacity. In other words, while everyone is working to build a better light bulb or to design your lighting so you need fewer bulbs, no one is turning off the lights when no-one is in the room.

That's where SLAuto comes in. I contend that there are huge tracks of computing in any large enterprise where compute capacity runs idle for extended periods. Desktop systems are certainly one of the biggest offenders, as are grid computing environments that are not pushed to maximum capacity at all times. However, possibly the biggest offender in any organization that does in-house development, extensive packaged system customization or business system integration is the dev/test environment.

Imagine such a lab where capacity that will be unused each evening/weekend, or for all but two weeks of a typical development cycle, or at all times except when testing a patch to a three year old rev of product, was shut down until needed. Turned off. Non-operational. Idle, but not idling.

Of course, most lab administrators probably feel extremely uncomfortable with this proposition. How are you going to do this without affecting developer/QA productivity? How do you know its OK to turn off a system? Why would my engineers even consider allowing their systems to be managed this way?

SLAuto addresses these concerns by simply applying intelligence to power management. A policy-based approach means a server can be scheduled for shutdown each evening (say, at 7PM), but be evaluated before shutdown against a set of policies that determine whether it is actually OK to complete the shut down.

Some example policies might be:

  • Are certain processes running that indicate a development/build/test task is still underway?
  • Is a specific user account logged in to the system right now?
  • Has disk activity been extremely low for the last four hours?
  • Did the owner of the server or one of his/her designated colleagues "opt-out" of the scheduled shutdown for that evening?

Once these policies are evaluated, we can see if the server meets the criteria to be shut down as requested. If not, keep it running. Such a system needs to also provide interfaces for both the data center administrators and the individual server owners/users to control the power state of their systems at all times, set policies and monitor power activities for managed servers.

I'll talk more about this in the coming week, but I welcome your input. Would you shut down servers in your lab? Your grid environment? Your production environment? What are your concerns with this approach? What policies come to mind that would be simple and/or difficult to implement?

Monday, July 30, 2007

SLAuto vs. SLA

A while back, Eric Westerkamp over at eCloudM asked a simple question that got me thinking. Eric wondered aloud (in print?) whether "using the Term Service Level Automation (SLA) causes confusion when presenting the ideas and topics into the business community. I have most often seen SLA refer to Service Level Agreements. While similar in concept, they are very different in implementation."

So, to keep things clear in my blog, I will now use the acronym SLAuto for Service Level Automation, and retain the SLA moniker for Service Level Agreements. I hope this eliminates confusion and allows the market to talk more freely about Service Level Automation.

Speaking of SLA, though, Steve Jones posted a great example of symbiotic relationship between the customer and the service provider in the SLA equation. To put it in the context of an enterprise data center, you could offer 100% up time, .1 sec response time and a 5 minute turnaround time, but it wouldn't be of any value if the customer's application was buggy, they were on a dial-up network and it took them six weeks to get the requirements right for a build-out.

Now, let's look at that in the context of SLAuto. To my eye, the service provider in an SLAuto environment is the infrastructure. The customer is any component or person that accesses or depends on any piece of that infrastructure. Thus, any SOA service can be a service provider in one context, and a customer in another. Even the policy engine(s) that automate the infrastructure can be thought of as a customer in the context of monitoring and management, and a service provider in terms of an interface for other customers to define service level parameters.

Steve's example hints that I could buy the "bestest", fastest, coolest high tech servers, switches and storage on the planet and it wouldn't increase my service levels if I couldn't deliver that infrastructure to the applications that required it quickly, efficiently and (most importantly) reliably. Or, for that matter, if those applications couldn't take advantage of it. So, if you're going to automate, your policy engine should be (you guessed it) quick, efficient and reliable. If it isn't, then your SLAs are limited by your SLAuto capabilities.

Not what you intended, I would think...

Friday, April 20, 2007

Service Level Automation Deconstructed: Analyzing Service Levels

This is the second post in my series providing a brief overview of the three critical assumptions of a Service Level Automation environment. Today I want to focus on the ways in which the metrics gathered from the "measure" capabilities of an SLA environment are evaluated to determine if and what action should be taken by the "response" capabilities.

Let me first acknowledge that my discussion of the measure capabilities included some analysis of simple metrics to create complex metrics. This is one piece of the analysis puzzle, and is a critical one to acknowledge. Ideally, all software and hardware systems would be designed to intelligently communicate the metrics that matter most to determine service levels. Where this consolidation occurs depends on the requirements of the environment:

  • Centralized approach: Gather fundamental data from target systems to central metrics processor and consolidate metrics there. The advantage here is having one place to maintain consolidation rules. The disadvantage is increased network traffic.
  • Decentralized approach: Gather fundamental data and do any analysis necessary to consolidate the fundamental data into a simplified composite metric there. Send the composite metric to the core rules engine (which I will discuss next).

Metrics consolidation is not really the core analytics function of a Service Level Automation architecture, however. The key functions are actually the following:

  • Are metrics being received as expected? (A negative response would likely indicate a failure in the target component or the communication chain with that component)
  • Are the metrics within the business and IT service level goals set for that metric
  • If metrics are outside of established service level goals, what response should be taken by the SLA environment

Given my recent reading into complex event processing (CEP), this seems like at least a specialized form of event processing to me. The analysis capabilities of an SLA environment must constantly monitor the incoming metrics data stream, look for patterns of interest (namely goal violations, but who knows...) and trigger a response when conditions dictate.

The great thing about this particular EP problem is that well designed solutions can be replicated to all data centers using similar metrics and response mechanisms (e.g. power controllers, OSes, switch interfaces, etc.). Since there are actually relatively few components in the data center stack to be managed (servers [physical, virtual, application, etc.], network and storage), the rule set required to provide basic SLA capabilities is replicable across a wide variety of customer environments.

(That's not to say the rule set is simple...its actually quite complex, and can be affected by new types of measurement and new technologies to be managed. Buy is definitely preferred over build in this space, but some customizability is always necessary.)

Finally, I'd also like to point out that there is a similar analysis function at the response end as at the measure end. Namely, it is often desirable for the response mechanism to take a composite action request and break it into discrete execution steps. The best example I can think of for this is a "power down" action sent from the SLA analysis environment to a server. Typical power controllers will take such a request, signal to the OS that a shutdown is imminent, whereupon the OS will execute any scripts and actions required before signalling that OS shutdown is complete. At that time, the power controller turns off power to the server.

As with measure, I will use the label "analyze" to reflect future posts expanding on the analysis concept. As always, I welcome your feedback and hope you will join the SLA conversation.

Monday, April 16, 2007

Two articles mentioning Service Level Automation

I recently set up Google Alerts to inform me about references to Service Level Automation on the web. There were many articles returned this week, (many of which involved Cassatt), but I found two additional articles of note. Each makes mention of Service Level Automation, and represents the growing interest in this approach.

The first is from the March 2006 issue of ACM Queue, entitled "Under New Management". The article was written by Duncan Johnston-Watt, the founder of Enigmatec. Johnston-Watt does an excellent job of outlining basic issues around one possible architecture for an autonomic data center. As expected for Enigmatec, its a policy automation focused approach, and is, in fact, one of the few articles from a policy engine vendor that I have see where the term Service Level Automation is used correctly.

Unfortunately, I don't necessarily agree that Johnston-Watt's architecture is optimal enterprise data centers. (It requires development of process automation flows to "optimize operational processes"--a significant amount of work that is prone to introducing new inefficiencies. It is also agent based, which alters the footprint of the software stacks being run in the data center, and can negatively affect the execution and architecture of the applications being managed.) All in all, though, there is some excellent information here for those thinking about Service Level Automation holistically, across the entire data center.

The other article, entitled "Virtually Speaking: Xen Achieves Higher Enterprise Consciousness" was published April 6, 2007 on ServerWatch. In the last few paragraphs of the article, uXcomm's aquisition of Virtugo is covered. In it, uXcomm claims the combination their Xen management tools and Virtugo's VMWare tools "fills a gap not just in uXcomm's portfolio but in the virtual landscape as well. Until now [...] there was a gap between service-level automation offerings and performance management products."

Hmmm. Not sure how providing SLA for only virtual servers counts as filling gaps...but, even so, I hope uXcomm is aware that everyone in this space realizes the need for resource optimization includes VM performance management. I guess my question would be, what is uXcomm doing about marrying Service Level Automation to the rest of the data center?

As a side note, I know that I owe two more articles on my Service Level Automation Deconstructed series. I am working on the "Analyze" overview now, but have discovered some interesting technology to discuss here that I am reading up on now.

Wednesday, April 11, 2007

Complexity and the Data Center

I just finished rereading a science book that has been tremendously influential on how I now think of software development, data center management and how people interact in general. Complexity: The Emerging Science at the Edge of Order and Chaos, by M. Mitchell Waldrop, was originally published in 1992, but remains today the quintessential popular tome on the science of complex systems. (Hint: READ THIS BOOK!)

John Holland (as told in Waldrop's history) defined complex systems as having the following traits:

  • Each complex system is a network of many "agents" acting in parallel
  • Each complex system has many levels of organization, with agents at any one level serving as the building blocks for agents at a higher level
  • Complex systems are constantly revising and rearranging their building blocks as they gain experience
  • All complex adaptive system anticipate the future (though this anticipation is usually mechanical and not conscious)
  • Complex adaptive systems have many niches, each of which can be exploited by an agent adapted to fill that niche

Now, I don't know about you, but this sounds like enterprise computing to me. It could be servers, network components, software service networks, supply chain systems, the entire data center, the entire IT operations and organization, etc. What we are all building here is self organizing...we may think we have control, but we are all acting as agents in response to the actions and conditions imposed by all those other agents out there.

A good point about viewing IT as a complex system can be found in Johna Till Johnson's Networld article, "Complexity, crisis and corporate nets". Johna's article articulates a basic concept that I am still struggling to verbalize regarding the current and future evolution of data centers. We are all working hard to adapt to our environments by building architectures, organizations and processes that are resistant to failure. Unfortunately, entire "ecosystem" is bound to fail from time to time. And there is no way to predict how or when. The best you can do is prepare for the worse.

One of the key reasons that I find Service Level Automation so interesting is that it provides a key "gene" to the increasingly complex IT landscape; the ability to "evolve" and "heal" the physical infrastructure level. Combine this with good, resilient software architectures (e.g. SOA and BPM) and solid feedback loops (e.g. BAM, SNMP, JMX, etc.) and your job as the human "DNA" gets easier. And, as the dynamic and automated nature of these systems gets more sophisticated, our IT environments get more and more self organizing, learning new ways to optimize themselves (often with human help) even as the environment they are adapting to constantly changes.

In the end, I like to think that no matter how many boneheaded decisions corporate IT makes, no matter how many lousy standards or products are introduced to the "ecosystem", the entire system will adjust and continually attempt to correct for our weaknesses. In the end, despite the rise and fall of individual agents (companies, technologies, people, etc.), the system will continually work to serve us better...at least until that unpredictable catastrophic failure tears it all down and we start fresh.

Thursday, March 22, 2007

Service Level Automation Deconstructed: Introduction

Service Level Automation starts with three simple premises:

* The factors contributing to software service quality can be measured electronically.

* Runtime targets indicating high quality of service can be defined for those measurements.

* Systems involved in delivering software functionality can be manipulated to keep those measurements within the runtime targets.

I think the support for each of these premises should be explored more deeply, so I plan to begin a little survey of the technologies and academics over the next few weeks. The idea is to get a good sense of what standards/technologies/concepts/etc. can be used to meet the requirements of each premise. I also hope to discuss how a system smart enough to take advantage of them(*) can save a large datacenter both in terms of direct costs, as well as in losses due to service level failures.

Why Service Level Automation? I wrote about this earlier. However, as a quick reminder, think of service level automation as meeting this objective:

Delivering the quantity and quality of service flow required by the business using the minimum resources required to do so.

I've been quite busy both at work and at home, so I'm hoping to use this exercise as a way to increase my posting frequency. Stay tuned for more.