Showing posts with label service levels. Show all posts
Showing posts with label service levels. Show all posts

Wednesday, February 27, 2008

Enterprise Architecture, Business Continuity and Integrating the Cloud

(Update: Throughout the original version of this post, I had misspelled Mr. Vambenepe's name. This is now corrected.)

William Vambenepe, a product architect at Oracle focusing on enterprise management of applications and middleware, pointed me to a blog by David Linthicum on IntelligentEnterprise that makes the case for why enterprise architects must plan for SaaS. In a very high level, but well reasoned post, Linthicum highlights why SaaS systems should be considered a part of enterprise architectures, not tangential to them.

As Vambenepe points out, perhaps the most interesting observation from Linthicum is the following:

Third, get in the mindset of SaaS-delivered systems being enterprise applications, knowing they have to be managed as such. In many instances, enterprise architects are in a state of denial when it comes to SaaS, despite the fact that these SaaS-delivered systems are becoming mission-critical. If you don't believe that, just see what happens if Salesforce.com has an outage.

I don't want to simply repeat Vambenepe's excellent analysis, and I absolutely agree with him. So let me just add something about SLAuto.

Take a look at Vambenepe's immediate response:
I very much agree with this view and the resulting requirements for us vendors of IT management tools.
Now add the comments from Microsoft's Gabriel Morgan that I discussed a couple of weeks ago.
Take for example Microsoft Word. Product Features such as Import/Export, Mail Merge, Rich Editing, HTML support, Charts and Graphs and Templates are the types of features that Customer 1.0 values most in a product. SaaS Products are much different because Customer 2.0 demands it. Not only must a product include traditional product features, it must also include operational features such as Configure Service, Manage Service SLA, Manage Add-On Features, Monitor Service Usage Statistics, Self-Service Incident Resolution as well.
Gabriel's point boiled down to the following equation:
Service Offering = (Product Features) + (Operational Features)
which I find to be entirely in agreement with Linthicum and Vambenepe.

As I am wont to do, let me push "Operational Features" as far as I think they can go.

In the end, what customers want from any service--software, infrastructure or otherwise--is control over the balance of quality, cost and time-to-market. Quality is measured through specific metrics, typically called service level metrics. Service level agreements (SLAs) are commitments to maintain service level metrics within commonly agreed boundaries and rules. In the end, all of these "operational features" are about allowing the end user to either

  1. define the service level metrics and/or their boundaries (e.g. define the SLA), or
  2. define how the system should respond if a metric fails to meet the SLA.

Item "2" is SLAuto.

I would argue that what you don't want is a closed loop SLAuto offering from any of your vendors. In fact, I propose right here, right now, that a standard (and, I am sure Simon Wardley would argue, open source) protocol or set of protocols for the following:
  1. Defining service level metrics (probably already exists?)
  2. Defining SLA bounds and rules (may also exist?)
  3. Defining alerts or complex events that indicate that an SLA was violated

Vendors could then use these protocols to build Operational Features that support a distributed SLAuto fabric, where the ultimate control over what to do in severe SLA violations can be controlled and managed outside of any individual provider's infrastructure, preferably at a site of the customer's choosing. This "customer advocate" SLAuto system would then coordinate with all of the customer's other business systems' individual SLAuto to become the automated enforcer of business continuity. In the end, that is the most fundamental role of IT, whether it is distributed or centralized, in any modern, information driven business.

"Nice, James," you say. "Very pretty 'pie-in-the-sky' stuff, but none of it exists today. So what are we supposed to do now?"

Implement SLAuto internally in your own data centers with your existing systems, that's what. Integrate SLAuto for SaaS as you understand the Operational Feature APIs from your vendors, and those vendors, your SLAuto vendor and/or your systems talent can develop interfaces into your own SLAuto infrastructure.

Evolve towards nirvana, don't try to reach it by taking tabs of vendor acid.

If you want more advice on how to do all of this, drop me a line (james dot urquhart at cassatt dot com) or comment below.

Tuesday, January 29, 2008

One Step To Prepare For Cloud Computing

Some of you may be wondering why I am making such a big stink about software architecture on a blog about service level automation (SLAuto). Well, as Todd Biske points out, "the relationships (and potentially collisions) between the worlds of enterprise system management, business process management, web service management, business activity monitoring, and business intelligence" are easier to resolve if the appropriate access to metrics is provided for a software service. For SLAuto, this means the more feedback you can provide from the service, process, data and infrastructure levels of your software architecture, the easier it is to automate service level compliance.

Let's look at a few examples for each level:

  • Service/Application: From the end user's perspective, this is what service levels are all about. Key metrics such as transaction rates (how many orders/hour, etc.), response times, error rates, and availability are what the end users of a service (e.g. consumers, business stakeholders, etc.) really care about.
  • Business Process: Business process metrics can warn the SLAuto environment about cross-service issues, business rule violations or other extraordinary conditions in the process cycle that would warrant capacity changes at the BPM or service levels.
  • Data Storage/Management: Primarily, this layer can inform the SLAuto system about storage needs and storage provisioning, which in turn is critical to automated deployment of applications into a dynamic environment.
  • Infrastructure: This is the most common form of metric used to make SLAuto decisions today. Such metrics as CPU utilization, memory utilization and I/O rates are commonly used in both virtualized and non-virtualized automated environments.

As noted, digital measurement of these data points can feed an SLAuto policy engine to trigger capacity adjustment, failure recover or other applicable actions as necessary to remain within defined service thresholds. While most of the technology required to support SLAuto is available, the truth is that the monitoring/metrics side of things is the most uncharted territory. As an action item, I ask all of you to take Todd's words of wisdom into account, and design not only for functionality, but also manageability. This will aid you greatly in the quest to build fluid systems that can best take advantage of utility infrastructure today.

Tuesday, November 06, 2007

Beating the Utility Computing Lockdown, Part 2

Well, not long after I posted part 1 of this series, Bert noted that he agreed with my assessment of lock-in, then preceded to note how his (competitive to my employer's) grid platform was the answer.

Now, Bert is just having fun cross promoting on a blog with ties to a competitor, but I think its only fair to note that no one has a platform that avoids vendor lock-in in utility computing today. The best that someone like 3TERA (or even Cassatt) can do is give you some leverage between the organizations that are utilizing their platform; however, to get the portability he speaks of, you have to lock your servers, (and possibly load balancers, storage, etc-etc-etc) into that platform. (Besides, as I understand it, 3TERA is really only portable at the "data center" level, not the individual server level. I suppose you could define a bunch of really small "data centers" for each application component, but in a SOA world, that just seems cumbersome to me.)

Again, what is needed is a truly open, portable, ubiquitous standard for defining virtual "components" and their operation level configurations that can be ported and run between a wide variety of virtualization, hardware and automation platforms. (Bert, I've been working on Cassatt--are you willing to push 3TERA to submit, cooperate on and/or agree to such a standard in the near future?) As I said once before, I believe the file system is the perfect place to start, as you can always PXE boot a properly defined image on any compatible physical or virtual machine, regardless of the vendor. (This is true for every platform except for Windows--c'mon Redmond, get with the program!) However, I think the community will have the final say here, and the Open Virtual Format is a hell of a start. (It still lacks any tracking of operation level configurations, such as "safe" CPU and memory utilization thresholds, SNMP traps to monitor for heartbeats, etc.)

Unfortunately, those standards aren't baked yet. So, here's what you can do today to avoid vendor lock-in with a capacity provider tomorrow. Begin with a utility computing platform that you can use in your existing environment today. Ideally, that platform:

  1. Does not require you to modify the execution stack of your application and server images (e.g.
    • no agentry of any kind that isn't already baked into the OS,
    • no requirement to run on virtualization if that isn't appropriate or cost effective,
  2. Uses a server/application/whatever imaging format that is open enough to "uncapture" or translate to a different format by hand if necessary--again, I like our approach of just capturing a sample server file system and "generalizing" it for replication as needed. It's reversible, if you know your OS well.)
  3. Is supported by a community or business that is committed to supporting open standards wherever appropriate and will provide a transition path form any proprietary approach to the open approach when it is available.

I used to be concerned that customers would ask why they should convert their own infrastructure into a utility (if it was their goal to use utility computing technology to reduce their infrastructure footprint). I now feel comfortable that the answer is simply because there is no safe alternative for large enterprises at this time. Leave alone the issue of security (e.g. can you trust your most sensitive data to S3), and the fact that there is little or no automation available to actually reduce your cost of operations in such an environment, there are many risks to consider with respect to how deeply you are willing to commit to a nascent marketplace today.

I encourage all of you to get started with the basic concepts of utility computing. I want to talk next about ways to cost justify this activity with your business, and talk little about the relationship between utility computing and data center efficiency.

Monday, October 15, 2007

Is your software ready for utility computing?

I've been seeing more thoughts on the effect of utility computing on software architectures lately, and one very well stated argument comes from Alistair Croll, Vice President of Product Management and co-founder of Coradiant, a performance tool company. Though clearly self-serving, his message is simple: if you are going to pay by the cycle--or even just share cycles between applications--you'd better make sure your software takes as few cycles as possible to do its job well.

This is one of the unforeseen effects of "paying for what you use", and I have to say its an effect that should scare the heck out of most enterprise IT departments. Although I would argue part of that fear should come from the exposure of lousy coding in most custom applications, the worst part is the lack of control most organizations will have over the lousy coding in the packaged applications they purchased and installed. Suddenly, algorithms matter again in all phases of software development, not just computing intensive steps.

The worst offender here will probably the the user interface components: Java SWING, AJAX and even browser applications themselves. To the extent that these are hosted from centralized computing resources (and even most desktops fall into this category in the some visionaries' eyes), then the incredible amount of constant cycling, polling and unnecessary redrawing will be painfully obvious in the next 10 years or so.

I have always been a strong proponent for not over-engineering applications. If you can meet the business's ongoing service levels with an architecture that cost "just enough" to implement, you were golden in my book. However, utility computing changes the mathematics here significantly, and that key phrase of "meet the business's ongoing service levels" comes much more into play. Ongoing service levels now include optimizations to the cost of executing the software itself; something that could be masked in a underutilized, siloed-stack world.

The performance/optimization guys must be loving this, because they now have a product that should see immediate increase in demand. If you are building a new business application today, you had better be:

  1. Building for a service-based, highly distributed, utility infrastructure world, and
  2. Making sure your software is a cheap to run as possible.

Number 2 above itself implies a few key things. Your software had better be:

  • as standards based as possible--making it possible for any computing provider to successfully deploy, integrate and monitor your application;
  • as simple to install, migrate and upgrade remotely as possible--to allow for cheap deployment into a competitive computing market;
  • as efficient to execute as possible--each function should take as few cycles as possible to do its job

The cost dynamics will be interesting to note, especially their effects on the agile processes, SOA, and ITIL movements. I will keep a careful tab on this, and will share my ongoing thoughts in future posts.

Thursday, October 04, 2007

Links - 10/4/2007

A Classic Introduction to SOA (DanNorth.net): Thanks to Jack van Hoof, I was led to this brilliant article on modelling SOAs in business terms. (Check out the PDF, the graphics and layout make it an even more fun read.) Rather than spend a bunch of words "me too"-ing Jack and Dan, let me just say that this is exactly the technique I have always used to design service oriented architectures, ever since my days in the mid-90s designing early service oriented architectures at Forte Software.

Classic examples of where this led to better design were the frequent arguments that I would have with customers and partners about where to put the "hire" method in a distributed architecture. Most of the "object oriented architects" I worked with would immediately jump to the conclusion that the "hire" method should be on the Employee class. However, if you sat down and modelled the hiring process, the employee never hired himself or herself. What would happen is the hiring manager would send the information about the employee to the HR office, who would then receive more information, create a new employee file and declare the new employment to the tax authorities. Thus, the "hire" method needed to be on the HR service, with the call coming from the application (or service) initiating employment (i.e. the hiring manager in software form), passing the employee object (or a representation of that object) for processing.

Without exception, that approach led to better architectures than trying to map every method that had any relation to a class of objects directly on the class itself.


Twilight of the CIO (RoughType: Nicholas Carr): Man, Nick is in rare form now that his is back from his blogging hiatus. His thesis here is that, with the advent of technologies that can be more easily managed outside of IT, and with IT departments doing less R&D and more shepherding outsourced and SaaS infrastructure, the need for the CIO role is diminishing--which I react to with mixed feelings.

On the one hand, there is no doubt that small and mid-sized non-high-tech businesses are going to have less need for a voice representing technical infrastructure issues on the executive board. There will still need to be management (as the first comment to Nick's post alludes to), but they will be a lot like the facilities guy in most businesses today--simply shepherding the services hired by the business.

(Perhaps the "centralized/decentralized pendulum" is definitely shifting wildly, with decentralization this time actually resulting in business systems residing outside of IT entirely?)

On the other hand, I'm not seeing the "simplified" nature of technology happening yet in most mid- to large-sized businesses. Cassatt sells utility computing platform software--basically an operating system for your data center. Resources are pooled and distributed as needed to meet the businesses needs (as defined in SLAs assigned to software). We make it easy to cut tremendous amounts of waste, rigidity and manual labor out of the basic data centers. CIOs love this vision, and drive technical changes in the customers we work with. However, implementations still take a long time. Why? Because most existing infrastructures are about 10 years behind the desired state of the art the IT department is trying to achieve. Also because its not just a technical change, its a cultural change. (By the way, so is SaaS.)

I fear that the lack of technical leadership on the executive team will actually hinder adoption of these critical new technologies and other technologies only being thought of now, or in the future. What I think ultimately needs to happen is that high level technical critical thinking skills need to be taught to the rest of the line-of-business executives, so that interesting new technologies will drive interesting new business opportunities in the years to come.

This goes to Marc Andreesen's recent post on how to prepare for a great career. Don't rest on your technical skills, or your business skills, but work hard to develop both. (Marc is another blogger who has been on a streak lately...read his career series and learn from someone who knows a little about success.)

Monday, September 24, 2007

Service-Oriented Everything...

Agility Principle: Service-Oriented Network Architecture (eBiz: Mark Milinkovich, Director, Service-Oriented Network Architecture, Cisco Systems): Cisco is touting the network as the center of the universe again, but this article is pretty close to the truth about software and infrastructure architectures we are moving to. Most importantly, Mark points out that there is a three layer stack that actually binds applications to infrastructure:

  • Applications layer - includes all software used for business purposes (e.g., enterprise resource planning) or collaboration (e.g., conferencing). As Web-based applications rely on the Extensible Markup Language (XML) schema and become tightly interwoven with routed messages, they become capable of supporting greater collaboration and more effective communications across an integrated networked environment.

  • Integrated network services layer - optimizes communications between applications and services by taking advantage of distributed network functions such as continuous data protection, multiprotocol message routing, embedded QoS, I/O virtualization, server load balancing, SSL VPN, identity, location and IPv6-based services. Consider how security can be enhanced with the interactive services layer. These intelligence network-centric services can be used by the application layer through either transparent or exposed interfaces presented by the network.

  • Network systems layer - supports a wide range of places in the network such as branch, campus and data center with a broad suite of collaborative connectivity functions, including peer-to-peer, client-to-server and storage-to-storage connectivity. Building on this resilient and secure platform provides an enterprise with the infrastructure on which services and applications can reliably and predictably ride.
Of course, he's missing a key layer:
Physical infrastructure layer - represents the body of physical(and possibly virtual) infrastructure components that support the applications,network services and network systems, not to mention the storage environment,management environment and, yes, Service Level Automation (SLAuto) environment.
It is important to note that, while the network may becoming a computer in its own right, it still requires physical infrastructure to run. And all of these various application, integrated network, and network systems services that Mark mentions not only depend on this infrastructure, but can actually be loosely coupled to the physical layer in a way that augments the agility of all four layers.

For example, imagine a world where your software provisioning is completely decoupled from your hardware provisioning. In other words, adding an application to your production data center doesn't require you to predict exactly what load the application is going to add to the network, server or storage capacity. Rather, you simply load the application into the SLAuto engine, let traffic start to arrive, measure the stress on existing capacity, and order additional hardware as required. Or, better yet, order hardware at the end of a quarter based on trend analysis from the previous quarter. No need for the software teams and the hardware teams to even talk to each other.

I will admit that it is unlikely that many IT departments will ever get to that "pie-in-the-sky" scenario--for some the risk of not guessing high enough on capacity overwhelms the cost of predicting short to medium term load. However, SLAuto allows you to get past the problems of siloed systems, such as "hitting the ceiling" in allocated capacity. Even if the SLAuto environment runs out of excess physical capacity, it can borrow the capacity it needs for high priority systems from lower priority applications.

The best part is that, since the SLAuto environment tracks every action it takes, there are easy ways to get reports showing everything from capacity utilization trend analysis to cost of infrastructure for a given application.

Back to Mark's article, though. It is good to see some consensus in the industry on where we are moving, even if each vendor is trying to spin it as if they are the heart of the new platform. In the end though, if the network is indeed the computer, the network and the data center will need operating systems. Mark has entire sections dedicated to designing for application awareness (this is where most data center automation technologies fall woefully short), and designing for virtualization (including all aspects of infrastructure virtualization). He is right on the money here, but there needs to be something that coordinates the utilization of all of these virtualized resources. This is where SLAuto comes in.

Most importantly, don't forget to integrate SLAuto into all four layers. Make sure that each "high" layer talks to the layers below it in a way that decouples the higher layer from the lower layer. Make sure that each lower layer uses that information to determine what adjustments it needs to make (including, possibly, to send the information to an even lower layer). And make sure your physical infrastructure layer is supported by an automation environment that can adjust capacity usage quickly and painlessly as applications, services and networks demand.

As you prepare your service oriented architecture of the future, don't forget the operations aspects. We are on the brink of an automated computing world that will change the cost of IT forever. However, it will only work for you if you take all of the components involved in meeting service levels/operation levels into account.

Tuesday, September 04, 2007

An easy way to get started with SLAuto

It's been an interesting week, leading up to the Labor Day weekend, but as of this morning I get to talk more openly about one project that has been taking a great deal of my time. As I have blogged about Service Level Automation ("SLAuto"), it may have dawned on some of you that achieving nirvana here means changing a lot about your current architecture and practices.

For example, decoupling software from hardware is easy to say, but requires significant planning and execution to implement (though this can be simplified somewhat with the right platform). Building the correct monitors, policies and interfaces is also time intensive work that requires the correct platform for success. However, as noted before, the biggest barriers to implementing SLAuto and utility computing are cultural.

There is an opportunity out there right now to introduce SLAuto without all of the trappings of utility computing, especially the difficult decoupling of software from hardware. It is an opportunity that the Silicon Valley is going ga-ga over, and it is a real problem with real dollar costs for every data center on the planet.

The opportunity is energy consumption management, aka the "green data center".

Rather than pitch Cassatt's solution directly, I prefer to talk about the technical opportunity as a whole. So let's evaluate what is going on in the "GDC" space these days. As I see it, there are three basic technical approaches to "green" right now:

  1. More efficient equipment, e.g. more power efficient chips, server architectures, power distribution systems, etc.
  2. More efficient cooling, e.g. hot/cold aisles, liquid cooling, outside air systems, etc.
  3. Consolidation, e.g. virtualization, mainframes, etc.

Still, there is something obvious missing here: no matter which of these technologies you consider, not one of them is actually going to turn off unused capacity. In other words, while everyone is working to build a better light bulb or to design your lighting so you need fewer bulbs, no one is turning off the lights when no-one is in the room.

That's where SLAuto comes in. I contend that there are huge tracks of computing in any large enterprise where compute capacity runs idle for extended periods. Desktop systems are certainly one of the biggest offenders, as are grid computing environments that are not pushed to maximum capacity at all times. However, possibly the biggest offender in any organization that does in-house development, extensive packaged system customization or business system integration is the dev/test environment.

Imagine such a lab where capacity that will be unused each evening/weekend, or for all but two weeks of a typical development cycle, or at all times except when testing a patch to a three year old rev of product, was shut down until needed. Turned off. Non-operational. Idle, but not idling.

Of course, most lab administrators probably feel extremely uncomfortable with this proposition. How are you going to do this without affecting developer/QA productivity? How do you know its OK to turn off a system? Why would my engineers even consider allowing their systems to be managed this way?

SLAuto addresses these concerns by simply applying intelligence to power management. A policy-based approach means a server can be scheduled for shutdown each evening (say, at 7PM), but be evaluated before shutdown against a set of policies that determine whether it is actually OK to complete the shut down.

Some example policies might be:

  • Are certain processes running that indicate a development/build/test task is still underway?
  • Is a specific user account logged in to the system right now?
  • Has disk activity been extremely low for the last four hours?
  • Did the owner of the server or one of his/her designated colleagues "opt-out" of the scheduled shutdown for that evening?

Once these policies are evaluated, we can see if the server meets the criteria to be shut down as requested. If not, keep it running. Such a system needs to also provide interfaces for both the data center administrators and the individual server owners/users to control the power state of their systems at all times, set policies and monitor power activities for managed servers.

I'll talk more about this in the coming week, but I welcome your input. Would you shut down servers in your lab? Your grid environment? Your production environment? What are your concerns with this approach? What policies come to mind that would be simple and/or difficult to implement?

Saturday, August 11, 2007

Links - 08/11/2007

Man vs machine, or, from SLA to SLAuto (Isabel Wang): Isabel was kind enough to provide her comments on SLAuto, and--no surprise--she get's it. In fact, her analogy at the bottom of this post is a wonderful one, and I hope she's OK if I use it (with appropriate credit, of course :):

"You don't need no SLAuto, you say, because you've got great customer service reps and data center techs? Well... 10 years ago I used to know people who prided themselves on their ability to dish out web space manually. They could charge credit cards and create customer folders faster than anyone else! Then competitors started using auto-provisioning tools and they went out of business. History will repeat itself."

Is the Tap Dry? (CXOtoday, India: Tahirih Gaur): Tahirih describes India's biggest aparent obsticals to utility computing: storage and inefficient management of outsourced IT. (Does anyone else see an irony in that? James McGovern?) She notes that many companies (banks for instance) have a problem with storing sensitive data on disks shared with competitors. She also quotes a Gartner statistic that 80% of all outsourcing deals are renegotiated within 3 years. I've posted on this before, and Nicholas Carr is writing extensively about it, but make no mistake that the move to utility computing is even more of a cultural shift than a technical shift. My employer is betting on the fact that a large number of organizations will not be comfortable outsourcing their utility computing entirely, and want to create a utility within their own infrastructure.

Utility computing's elusive definition (CNet news.com: David Becker): In searching for more coverage on utility computing, I came across this 2003 article covering a panel discussion on the topic at that year's Comdex. My first reaction in reading it is that the more things change, the more they stay the same. All of the issues presented here remain true today. I don't see one element of this article that doesn't ring true today (other than new marketing names for the vendor products).

I also love the proposition by Tony Siress (then Senior Director of Advanced Services for Sun) of transportation as a better analogy of utility computing than electricity:

Siress maintained that transportation is a better analogy, considering how people employ a combination of owned, leased and rented cars along with taxis to meet their changing transit needs. "Taxi cabs are a good example of a fully outsourced piece of infrastructure, and they're the right approach in some situations" he said. "The trick is understanding the mix of approaches that delivers the highest value and the least risk to you."

This actually highlights something that I have trouble remembering sometimes; that utility computing isn't a one-size-fits-all approach. Not every application is appropriate for managed hosting, nor does every one require a private IT utility. Some "trips" (analogous to either transactions or functions?) require multiple "modes of transportation": a little SaaS, a little hosted virtual server capacity, even a few bare metal servers in a closet thrown in for good measure. The challenge for SLAuto is to provide policy across all of these, or at least provide the building blocks to do so.

Functionality (or "service flow") is the electricity in utility computing; hardware and software are just the generators and transformers. The network is the power line, and SLAuto is the demand management system.

Monday, July 30, 2007

SLAuto vs. SLA

A while back, Eric Westerkamp over at eCloudM asked a simple question that got me thinking. Eric wondered aloud (in print?) whether "using the Term Service Level Automation (SLA) causes confusion when presenting the ideas and topics into the business community. I have most often seen SLA refer to Service Level Agreements. While similar in concept, they are very different in implementation."

So, to keep things clear in my blog, I will now use the acronym SLAuto for Service Level Automation, and retain the SLA moniker for Service Level Agreements. I hope this eliminates confusion and allows the market to talk more freely about Service Level Automation.

Speaking of SLA, though, Steve Jones posted a great example of symbiotic relationship between the customer and the service provider in the SLA equation. To put it in the context of an enterprise data center, you could offer 100% up time, .1 sec response time and a 5 minute turnaround time, but it wouldn't be of any value if the customer's application was buggy, they were on a dial-up network and it took them six weeks to get the requirements right for a build-out.

Now, let's look at that in the context of SLAuto. To my eye, the service provider in an SLAuto environment is the infrastructure. The customer is any component or person that accesses or depends on any piece of that infrastructure. Thus, any SOA service can be a service provider in one context, and a customer in another. Even the policy engine(s) that automate the infrastructure can be thought of as a customer in the context of monitoring and management, and a service provider in terms of an interface for other customers to define service level parameters.

Steve's example hints that I could buy the "bestest", fastest, coolest high tech servers, switches and storage on the planet and it wouldn't increase my service levels if I couldn't deliver that infrastructure to the applications that required it quickly, efficiently and (most importantly) reliably. Or, for that matter, if those applications couldn't take advantage of it. So, if you're going to automate, your policy engine should be (you guessed it) quick, efficient and reliable. If it isn't, then your SLAs are limited by your SLAuto capabilities.

Not what you intended, I would think...

Wednesday, July 25, 2007

Why it all boils down to infrastructure management...

Digital Daily made my day today with the story of data center operator 365 Main's declaration of its amazing up time record followed hours later by complete loss of their San Francisco data center for a good hour--a data center populated by a veritable "who's who" of Silicon Valley tech companies. As a result of the SF power outage, a good portion of the Web 2.0 social network world was also down.

What does this have to do with Utility Computing? Well, as Nicholas Carr points out, it has everything to do with the key problem of computing today: not software, not social networks, but infrastructure. 365 Main is (reportedly) one of the best collocation hosting facilities in the world, and despite a bunch of power failure backup, including two sets of generators, they didn't get the job done. Worse yet, they were bitten by something they can do little about: old and tired power infrastructure.

When we talk about utility computing architectures and standards, we must remember that, despite infrastructure becoming a commodity, it is still the foundation on which all other computing services are reliant. If we can't provide the basic capabilities to keep servers running, or quickly replace them if they fail, then all of our pretty virtual data centers, software frameworks and business applications are worthless.

For 365 Main, their focus initially will probably be on why all that backup equipment didn't save their butts as it was designed (and purchased) to do. I mean, as much as I would like to make this a story about how great Service Level Automation software would have saved the day, I have to admit that even my employer's software can't make software run on a server with no electricity.

However, and perhaps more importantly, the operations teams of those Silicon Valley companies that found themselves without service for this hour or so must ask themselves some key questions:

  1. Why weren't their systems distributed geographically, so that some instances of every service would remain live even if an entire data center were lost? If the problem was the architectures of the applications, that is a huge red flag that they just aren't putting the investment into good architecture that they need to. If the problem is the cost/difficulty of replicating geographically, then I believe Service Level Automation/utility computing software can help here by decoupling software images from the underlying hardware (virtual or physical).
  2. Why couldn't they have failed over to a DR site in a matter of minutes? Using the same utility computing software and some basic data replication technologies, any one of these vendors could have kept their software and data images in sync between live and DR instances, and fired up the whole DR site with one command. Heck, they could even have used another live hosting facility/lab to act as the DR site and just repurposed as much equipment as they needed.
  3. How are they going to increase their adaptability to "environmental" events in the future, especially given the likelihood that they won't even own their infrastructure? Again, this is where various utility computing standards become important, including software and configuration portability.

I recommend reading Om Malik's piece (as referenced by Carr) to get an understanding why this is more important than ever. According to Malik, hosting everywhere is a house of cards, if only because of the aging power infrastructure that they rely on. Some people are commenting that geographic redundancy is unnecessary when using a top-tier hosting provider, but I think yesterday's events prove otherwise.

What I believe the future will bring is commoditized computing capacity that, in turn, will make the cost of distributing your application across the globe almost the same as running it all in one data center. You may be charged extra for "high availaility" service levels, but the cost won't be that significant, with the difference mostly going to the additional monitoring and service provided to guarantee those service levels.

Of course, that will require portability standards...

Thursday, March 22, 2007

Service Level Automation Deconstructed: Introduction

Service Level Automation starts with three simple premises:

* The factors contributing to software service quality can be measured electronically.

* Runtime targets indicating high quality of service can be defined for those measurements.

* Systems involved in delivering software functionality can be manipulated to keep those measurements within the runtime targets.

I think the support for each of these premises should be explored more deeply, so I plan to begin a little survey of the technologies and academics over the next few weeks. The idea is to get a good sense of what standards/technologies/concepts/etc. can be used to meet the requirements of each premise. I also hope to discuss how a system smart enough to take advantage of them(*) can save a large datacenter both in terms of direct costs, as well as in losses due to service level failures.

Why Service Level Automation? I wrote about this earlier. However, as a quick reminder, think of service level automation as meeting this objective:

Delivering the quantity and quality of service flow required by the business using the minimum resources required to do so.

I've been quite busy both at work and at home, so I'm hoping to use this exercise as a way to increase my posting frequency. Stay tuned for more.

Monday, February 26, 2007

When CPU utilization is not enough...

Where have I been, you might ask?

Much to my delight, the last few weeks have been filled with customer activity, ranging from helping a Service Level Automation-enabled appliance for a major software company, to assisting the financial wing of one of the world's largest manufacturers to experience first hand the benefits of utility computing.

The latter runs an application that is highly dependent on Windows Terminal Services to deliver a client-server UI to thousands of retail outlets world-wide. Uptime is critical to this application, as customers will go elsewhere for financing if this application doesn't confirm credit within minutes of a purchase decision.

Unfortunately, WTS is also a very inefficient consumer of server payload. It is a session-based infrastructure, which means that a user will be attached to a specific server for their desktop access until they either log out or are kicked off. If 15 user sessions share a physical server, there is no way to predict the load on the system. All 15 sessions could be idle, or all could quickly start consuming cycles simultaneously.

This gives me my first really good example of when CPU and/or memory utilization are not good Service Level metrics on their own. Imagine an environment using WTS to support hundreds of users. These users use their Windows sessions to run a variety of tasks, much like any Windows user. Some tasks use a high level of CPU and memory, others very little. Quite often, the session will sit idle for several minutes.

Now, if you create lots of sessions because the CPU is idle, you could end up with problems if they all get active at the same time (say right after lunch). If you stop creating sessions on a server because CPU utilization is high, you may end up with a highly under utilized server when one user's game of Quake wraps up.

That's not to say that CPU or memory utilization aren't an important part of the Service Level "equation". The truth is, there are several metrics that apply to WTS capacity: sessions, CPU utilization, memory utilization, licenses, etc. Since determining Service Level compliance probably involves evaluating the relationships between several of these metrics at once, there will most likely be one or two compound metrics based on mathematical equations combining these "root" metrics in a way that reasonable thresholds can be set.

Another interesting observation is that this is a lot like the Java EE Service Level Automation problem. (Thanks to Luis Cuyun for pointing this out.) While most horizontally scalable application tiers can be scaled up and down as a unit (i.e. "add a node/remove any node"), app servers, hypervisors and (now) WTS all must be monitored as a unit, but managed on a per server basis (in this case, "add a node/remove this specific node"). This is because the "instances" that each of these software resources are hosting are "sticky" to a server (VMotion not withstanding), and you don't want to shut down any server when capacity is not longer needed, you want to only remove the specific servers with no live sessions remaining on them.

(Speaking of VMotion, one of the things that both Java EE and WTS will require to be really optimizable is the ability to move live services/sessions from one server to another in real-time. Anyone know of a technology addressing either of these?)

The good news for a good Service Level Automation environment (*ahem*) is that if one of these problems (Java EE, virtual servers or WTS) is solved correctly, the same basic technology can be applied to all of them. That's not to say that anyone is doing this for WTS today (to my knowledge, no one is), but I like the idea that the use cases that apply to Java EE SLA also apply to WTS SLA.

I'm hoping to have more to write about this as this pilot continues. In the meantime, anyone with Windows experience is welcome and encouraged to contribute their two cents to this discussion. In particular, are there any tried and true service level metrics for WTS that are being used out there? In general, there are so many moving parts here, that I am sure there are many critical factors to Service Level Automation of WTS that I have not covered, or even considered.