Ken Oestreich blogged recently about the very cool, probably landmark release of Cassatt that just became available, Cassatt Active Response 5.1. He very eloquently runs down the biggest feature--demand based policies--so I won't repeat all of that here. What I thought I would do instead is relate my personal thoughts on monitoring based policies and how they are the key disruptive technology for data centers today.
To be sure, everyone is talking about server virtualization in the data center market today, and that's fine. It's core short-term benefit, physical system consolidation and increased utilization is key for cost-constrained IT departments, and features such as live motion and automatic backup are creating new opportunities that should be carefully considered. However, virtualization alone is limited in its applications, and does little to actually optimize a data center over time. (This is why VMWare is emphasizing management over just virtualizing servers these days.)
The technology that will make the long term difference is resource optimization: applying automation technologies to tuning how and when physical and virtual infrastructure is used to solve specific business needs. It is the automation software that will really change the "deploy and babysit" culture of most data centers and labs today. The new description will be more like "deploy and ignore".
To really optimize resource usage in real time, the automation software must use a combination of monitoring (aka "measure"), a policy engine or other logic system (aka "analyze") and interfaces to the control systems of the equipment and software it is managing (aka "respond"). It turns out that the "respond" part of the equation is actually pretty straight forward--lots of work, but straight forward. Just write "driver" like components that know how to talk to various data center equipment (e.g. Windows, DRAC, Cisco NX-OS, NetApp Data ONTAP, etc.), as well as handle error conditions by directly responding or forwarding the information to the policy engine.
The other two, however, require more immediate configuration by the end user. Measure and analyze, in fact, are where the entire set of Service Level Automation (SLAuto) parameters are defined and executed on. So, this is where the key user interface between the SLAuto system and end user has to happen.
What Cassatt has announced is a new user interface to define demand based policies as the end user sees fit. For example, what defines an idle server? Some systems use very little CPU while they wait for something to happen (at which point they get much busier), so simply measuring CPU isn't good enough in those cases. Ditto for memory in systems that are compute intensive but handle very little state.
What Cassatt did that is so brilliant (and so unique) is to allow the end user to leverage the full range of SNMP attributes for their OS, as well as JMX and even scripts running on the monitored system to create expressions that define an idle metric that is right for that system. For example, on a test system you may in fact say that a system is idle when the master test controller software indicates that no test is being run on that box. On another system, you may say its idle when no user accounts are currently active. Its up to you to define when to attempt to shut down a box, or reduce capacity for a scale-out application.
Even when such an "idle" system is identified, Cassatt gives you the ability to go further and write some "spot checks" to make sure they system is actually OK to shut down. For example, in the aforementioned test system, Cassatt may determine that its worth trying to power down a system, but a spot check could be run to determine if a given process is still running, or an administrator account is currently actively logged in to the box that would indicate to Cassatt that it should ignore that system for now.
I know of no one else that has this level of GUI configurable monitor/analyze/respond sophistication today. If anyone wants to challenge that, feel free. Now that I no longer work at Cassatt, I'd be happy to learn about (and write about) alternatives in the marketplace. Just remember that it has to be easy to configure and execute these policies, and scripting the policies themselves is not good enough.
It is clear from the rush to release resource optimization products for the cloud, such as RightScale, Scalr, and others, that this will be a key feature for distributed systems moving forward. In my opinion, Cassatt has launched itself into the lead spot for on premises enterprise utility computing. I can't wait to see who responds with the next great advancement.
Disclaimer: I am a Cassatt shareholder (or soon will be).
Thursday, May 22, 2008
Cassatt Announces Active Response 5.1 with Demand Base Policies
Tuesday, September 04, 2007
An easy way to get started with SLAuto
It's been an interesting week, leading up to the Labor Day weekend, but as of this morning I get to talk more openly about one project that has been taking a great deal of my time. As I have blogged about Service Level Automation ("SLAuto"), it may have dawned on some of you that achieving nirvana here means changing a lot about your current architecture and practices.
For example, decoupling software from hardware is easy to say, but requires significant planning and execution to implement (though this can be simplified somewhat with the right platform). Building the correct monitors, policies and interfaces is also time intensive work that requires the correct platform for success. However, as noted before, the biggest barriers to implementing SLAuto and utility computing are cultural.
There is an opportunity out there right now to introduce SLAuto without all of the trappings of utility computing, especially the difficult decoupling of software from hardware. It is an opportunity that the Silicon Valley is going ga-ga over, and it is a real problem with real dollar costs for every data center on the planet.
The opportunity is energy consumption management, aka the "green data center".
Rather than pitch Cassatt's solution directly, I prefer to talk about the technical opportunity as a whole. So let's evaluate what is going on in the "GDC" space these days. As I see it, there are three basic technical approaches to "green" right now:
- More efficient equipment, e.g. more power efficient chips, server architectures, power distribution systems, etc.
- More efficient cooling, e.g. hot/cold aisles, liquid cooling, outside air systems, etc.
- Consolidation, e.g. virtualization, mainframes, etc.
Still, there is something obvious missing here: no matter which of these technologies you consider, not one of them is actually going to turn off unused capacity. In other words, while everyone is working to build a better light bulb or to design your lighting so you need fewer bulbs, no one is turning off the lights when no-one is in the room.
That's where SLAuto comes in. I contend that there are huge tracks of computing in any large enterprise where compute capacity runs idle for extended periods. Desktop systems are certainly one of the biggest offenders, as are grid computing environments that are not pushed to maximum capacity at all times. However, possibly the biggest offender in any organization that does in-house development, extensive packaged system customization or business system integration is the dev/test environment.
Imagine such a lab where capacity that will be unused each evening/weekend, or for all but two weeks of a typical development cycle, or at all times except when testing a patch to a three year old rev of product, was shut down until needed. Turned off. Non-operational. Idle, but not idling.
Of course, most lab administrators probably feel extremely uncomfortable with this proposition. How are you going to do this without affecting developer/QA productivity? How do you know its OK to turn off a system? Why would my engineers even consider allowing their systems to be managed this way?
SLAuto addresses these concerns by simply applying intelligence to power management. A policy-based approach means a server can be scheduled for shutdown each evening (say, at 7PM), but be evaluated before shutdown against a set of policies that determine whether it is actually OK to complete the shut down.
Some example policies might be:
- Are certain processes running that indicate a development/build/test task is still underway?
- Is a specific user account logged in to the system right now?
- Has disk activity been extremely low for the last four hours?
- Did the owner of the server or one of his/her designated colleagues "opt-out" of the scheduled shutdown for that evening?
Once these policies are evaluated, we can see if the server meets the criteria to be shut down as requested. If not, keep it running. Such a system needs to also provide interfaces for both the data center administrators and the individual server owners/users to control the power state of their systems at all times, set policies and monitor power activities for managed servers.
I'll talk more about this in the coming week, but I welcome your input. Would you shut down servers in your lab? Your grid environment? Your production environment? What are your concerns with this approach? What policies come to mind that would be simple and/or difficult to implement?
Friday, July 20, 2007
A Helping Hand Comes In Handy Sometimes
You may remember my recent post on how data centers resemble complex adaptive systems. This description of a data center has a glaring difference from a true definition of complex adaptive systems, however; data centers require some form of coordinated management beyond what any single entity can provide. In a truly complex adaptive system, there would be no "policy engines" or even Network Operations Centers. Each server, each switch, each disk farm would attempt to adapt to its surroundings, and either survive or die.
Therein lies the problem, however. Unlike a biological system, or the corporate economy, or even a human society, data centers cannot afford to have one of its individual entities (or "agents" in complex systems parlance) arbitrarily disappear from the computing environment. It certainly cannot rely on "trial and error" to determine what survives and what doesn't. (Of course, in terms of human management of IT, this is often what happens, but never mind...)
Adam Smith called the force that guided selfish individuals to work together for the common benefit of the community the "invisible hand". The metaphor is good for explaining how decentralized adaptive systems can organize for the greater good without a guiding force, but the invisible hand depends on the failure of those agents who don't adapt.
Data centers, however, need a "visible hand" to quickly correct some (most?) agent failures. To automate and scale this, certain omnipotent and omnipresent management systems must be mixed into the data center ecology. These systems are responsible for maintaining the "life" of dying agents, particularly if the agents lose the ability to heal themselves.
Now, a topic for another post is the following: can several individual resource pools, each with their own policy engine, be joined together in a completely decentralized model?
Thursday, May 24, 2007
Service Level Automation Deconstructed: Respond
For the third and last in my series breaking down the three key assumptions behind Service Level Automation, I would like to focus on how SLA environments can control data center configuration in response to service level goal violations. These goal violations and the high level actions to be taken are determined by the analysis capability of the environment. Details of how to accomplish those high level actions, however, are decided and executed by the response function.
Essentially, the response function of an SLA environment is very much like the driver set that your operating system uses to translate high level actions (e.g. "store this file") to device specific actions ("Move head 32 steps to center, find block 4D5EF, etc."). The responsibility here is to provide the interface between the SLA analysis engine and specific standard or proprietary interfaces to everything from server hardware to network switches to operating systems and middleware.
I see the following key interface points in today's environments:
- Power Controllers/MPDUs: Job 1 of a service level automation environment is providing the resources required to meet the needs of the software environment, and only those resources. Turn those servers on when they are needed, and off when they are not. This includes virtual server hosts. (Examples: DRAC, iLO, RSA II,MPDUs)
- Operating Systems: Before you shut off that server, make sure you've "gently" shut down its software payload. Well written server payloads for automated environments will both start up and aquire intial state (if any), and shut down while preserving any necessary state without human intervention. However, from a communications perspective, each action starts with the OS. (Examples: Red Hat, SuSE, MS Windows, Sun Solaris)
- Middleware/Virtualization: It is interesting to note that many software payload components (e.g. an application server or a hypervisor) are both software to be managed, and computing capacity themselves. For example, an application server should be managed to specific service levels relating to its relationship with its host server (e.g. CPU utilization, thread counts, etc.), while also treated as a capacity resource for JavaEE applications and services. As such, these software containers should be managed for their own guest payloads much like a physical server would for the overall server payload. (BEA Weblogic, VMWare ESX, XenSource XenEnterprise)
- Layer 2 Networking: In order to use a server to meet an application's needs, that server must have access to required networks. True automation requires that switch ports be reconfigured as necessary to ensure access to specifically the VLANs required by the payloads they will represent. (Examples: Cisco 3750, Extreme Summit400)
- Network Attached Storage (NAS): The beauty of NAS devices is that they can be dynamically attached to a software payload at startup, without requiring any hardware configuration beyond the Layer 2 configuration described above. SAN is also useful (and common), but requires hardware configuration to make work. That complicates the role of automation. Part of the problem is the inconsistent remote configurability of fiber switches, which may be mitigated somewhat with iSCSI. However, NAS is quickly becoming the preferred storage mechanism in large data centers. (Examples: NetApp FAS, Adaptec Snap Server)
Over time, I see the industry adding more and more "drivers" to manage more and more data center (and perhaps desktop) resources. Imagine a world in which each software and/or hardware vendor produced standard SLA drivers for each individual component that makes up your data center environment. Every switch, disk and server; every service, container and OS; even every light bulb and air conditioner are connected to a single service level policy engine in which business policy (including cost of operations) drives automated decisions about their use and disuse.
Its not here yet, but you won't have to wait long...
I will use the label "respond" to tag posts related to response interfaces.
Wednesday, April 11, 2007
Complexity and the Data Center
I just finished rereading a science book that has been tremendously influential on how I now think of software development, data center management and how people interact in general. Complexity: The Emerging Science at the Edge of Order and Chaos, by M. Mitchell Waldrop, was originally published in 1992, but remains today the quintessential popular tome on the science of complex systems. (Hint: READ THIS BOOK!)
John Holland (as told in Waldrop's history) defined complex systems as having the following traits:
- Each complex system is a network of many "agents" acting in parallel
- Each complex system has many levels of organization, with agents at any one level serving as the building blocks for agents at a higher level
- Complex systems are constantly revising and rearranging their building blocks as they gain experience
- All complex adaptive system anticipate the future (though this anticipation is usually mechanical and not conscious)
- Complex adaptive systems have many niches, each of which can be exploited by an agent adapted to fill that niche
Now, I don't know about you, but this sounds like enterprise computing to me. It could be servers, network components, software service networks, supply chain systems, the entire data center, the entire IT operations and organization, etc. What we are all building here is self organizing...we may think we have control, but we are all acting as agents in response to the actions and conditions imposed by all those other agents out there.
A good point about viewing IT as a complex system can be found in Johna Till Johnson's Networld article, "Complexity, crisis and corporate nets". Johna's article articulates a basic concept that I am still struggling to verbalize regarding the current and future evolution of data centers. We are all working hard to adapt to our environments by building architectures, organizations and processes that are resistant to failure. Unfortunately, entire "ecosystem" is bound to fail from time to time. And there is no way to predict how or when. The best you can do is prepare for the worse.
One of the key reasons that I find Service Level Automation so interesting is that it provides a key "gene" to the increasingly complex IT landscape; the ability to "evolve" and "heal" the physical infrastructure level. Combine this with good, resilient software architectures (e.g. SOA and BPM) and solid feedback loops (e.g. BAM, SNMP, JMX, etc.) and your job as the human "DNA" gets easier. And, as the dynamic and automated nature of these systems gets more sophisticated, our IT environments get more and more self organizing, learning new ways to optimize themselves (often with human help) even as the environment they are adapting to constantly changes.
In the end, I like to think that no matter how many boneheaded decisions corporate IT makes, no matter how many lousy standards or products are introduced to the "ecosystem", the entire system will adjust and continually attempt to correct for our weaknesses. In the end, despite the rise and fall of individual agents (companies, technologies, people, etc.), the system will continually work to serve us better...at least until that unpredictable catastrophic failure tears it all down and we start fresh.
Thursday, March 22, 2007
Service Level Automation Deconstructed: Introduction
Service Level Automation starts with three simple premises:
* The factors contributing to software service quality can be measured electronically.
* Runtime targets indicating high quality of service can be defined for those measurements.
* Systems involved in delivering software functionality can be manipulated to keep those measurements within the runtime targets.
I think the support for each of these premises should be explored more deeply, so I plan to begin a little survey of the technologies and academics over the next few weeks. The idea is to get a good sense of what standards/technologies/concepts/etc. can be used to meet the requirements of each premise. I also hope to discuss how a system smart enough to take advantage of them(*) can save a large datacenter both in terms of direct costs, as well as in losses due to service level failures.
Why Service Level Automation? I wrote about this earlier. However, as a quick reminder, think of service level automation as meeting this objective:
Delivering the quantity and quality of service flow required by the business using the minimum resources required to do so.
I've been quite busy both at work and at home, so I'm hoping to use this exercise as a way to increase my posting frequency. Stay tuned for more.


