Showing posts with label virtualization. Show all posts
Showing posts with label virtualization. Show all posts

Wednesday, July 16, 2008

Watch out for Cisco, kids!

What is the most important enabler of distributed computing architectures, such as cloud oriented architectures? What is the one thing that has to be in ample supply before the other elements of the data center come into play? Is it the number of servers or CPU power available for computing? Is it the size and speed of the disks and network storage devices? Is it the distributed software architectures themselves?

My answer? None of the above. It's network bandwidth, baby, all the way.

Why? Well, let's break down where the costs of distributed systems lay. We all know that CPU capabilities double roughly every couple of years, and we also know that disk I/O slows those CPUs down, but not at the rate that network I/O typically does. When designing distributed systems, you must first be aware of network latency and control traffic between components to have any chance in heck of meeting rigorous transaction rate demands. The old rule at Forte Software, for what it's worth, was:

  • First reduce the number of messages as much as possible
  • Then reduce the size of those messages as much as possible
Increase adherence to those rules, and your software would outperform less optimized applications every time. It was easy to look like a performance tuning genius in those days.

What is exciting about today's environment, however, is that network technology is changing rapidly. Bandwidth speeds are increasing quickly (though not as fast as CPU speeds), and this high speed bandwidth is becoming more ubiquitous world wide. Inter-data-center speeds are increasingly mind boggling, and WAN optimization apparently has removed much of the fear of moving real-time traffic between geographically disparate environments.

All of this is a huge positive to cloud oriented architectures. When you design for the cloud, you want to focus on a few key things:
  • Software fluidity - The ability of the software to run cleanly in a dynamic infrastructure, where the server, switch port, storage and possibly even the IP address changes day by day or minute by minute.

  • Software optimization - Because using a cloud service costs money, whether billed by the CPU hour, the transaction or the number of servers used, you want to be sure you are getting your money's worth when leveraging the cloud. That means both optimizing the execution profile of your software, and the use of external cloud services by the same software.

  • Scalability - This is well established, but clearly your software must be able to scale to your needs. Ideally, it should scale infinitely, especially in environments with highly unpredictable usage volume (such as the Internet).

Achieving any of these in an environment where your network bandwidth is constricting your options is nearly impossible.

Oh, and one more thing. The network is the first element of your data center that sees load, failure and service level compliance. Think about it--without the eyes of the network, all of your other data center elements become black boxes (though often physically with those annoying beeps and little blinking orange lights). What are the nerves in the data center nervous system? Network cables, I would say.

Today I saw two really good posts about possible network trends driven by the cloud, and how Cisco's new workhorse leverages "virtualized" bandwidth and opens the door to commodity cloud capacity. The first is a post by Douglas Gourlay of Cisco, which simply looks at the trends that got us to where we are today, and further trends that will grease the skids for commodity clouds. I am especially interested in the following observations:
"8) IP Addressing will move to IPv6 or have IPv4 RFCs standardized that allow for a global address device/VM ID within the addressing space and a location/provider sensitive ID that will allow for workload to be moved from one provider to another without changing the client’s host stack or known IP address ‘in flight’. Here’s an example from my friend Dino.

9) This will allow workload portability between Enterprise Clouds and Service Provider Clouds.

10) The SP community will embrace this and start aggressively trying to capture as much footprint as possible so they can fill their data centers to near capacity allowing for them to have the maximum efficiency within their operation. This holds to my rule that ‘The Value of Virtualization is compounded by the number of devices virtualized’.

11) Someone will write a DNS or a DNS Coupled Workload exchange. This will allow the enterprise to effectively automate the bidding of workload allocation against some number or pool of Service Providers who are offering them the compute, storage, and network capacity at a given price. The faster and more seamless the above technologies make the shift of workload from one provider to another the simpler it is in the end for an exchange or market-based system to be the controlling authority for the distribution of workload and thus $$$’s to the provider who is most capable of processing the workload."

The possibility that IP addresses could successfully travel with their software payloads is incredibly powerful to me, and I think would change everything for both "traditional" VM users, as well as the virtual appliance world. The possibility that my host name could travel with my workload, even as it is moved in real time from one vendor to another is, of course, cloud computing nirvana. To see someone who obviously knows something about networking and networking trends spell out this possibility got my attention.

(Those who see a fatal flaw in Doug's vision are welcome to point it out in the comments section below, or on Doug's blog.)

The second post is from Hurwitz analyst, Robin Bloor, who describes in brilliant detail why Cisco's Nexus 7000 series is different, and why it could very well take over the private cloud game. As an architecture, it essentially makes the network OS the policy engine for controlling provisioning and load balancing, though with bandwidth speeds that blow away today's standards (10G today, but room for 40G and 100G standards in the future). Get to those speeds, and all of a sudden something other than network bandwidth is your restricting function in scaling a distributed application.

I have been cautiously excited about the Nexus announcement from the start. Excited because the vision of what Nexus will be is so compelling to me, for all of the reasons I describe above. (John Chambers, CEO of Cisco, communicates that vision in a video that accompanied the Nexus 5000 series launch.) Cautious, because it reeks of old-school enterprise sales mentality, with Cisco hoping to "own" whole corporate IT departments by controlling both how software runs, and what hardware and virtualization can be bought to run it on. Lock-in galore, and something the modern, open source aware corporate world may be a little uneasy about.

That being said, as Robin put it, "In summary: The network is a computer. And if you think that’s just a smart-ass bit of word play: it’s not."

Robin further explains Cisco's vision as follows:

"Cisco’s vision, which can become reality with the Nexus, is of a data center that is no longer defined by computer architecture, but by network architecture. This makes sense on many levels. Let’s list them in the hope of making it easier to understand.

  1. Networks have become so fast that in many instances it is practical to send the the data to the program, or to send the program to the data, or to send both the program and the data somewhere else to execute. Software architecture has been about keeping data and process together to satisfy performance constraints. Well Moore’s Law reduced the performance issue and Metcalfe’s Law opened up the network. All the constraints of software architecture reduced and they continue to reduce. Distributing both software and data becomes easier by the year.
  2. Software is increasingly being delivered as a service that you connect to. And if it cannot deliver the right performance characteristics in the place where it lives, you move it to a place where it can.
  3. Increasingly there is more and more intelligence being placed on the switch or on the wire. Of course Cisco has been adding intelligence to the switch for years. Those Cisco firewalls and VPNs were exactly that. But also, in the last 5 years, agentless sotware (for example some Intrusion Detection products) has become prominent. Such applications simply listen to the network and initiate action if they “don’t like what they hear”. The point is that applications don’t have to live in server blade cabinets. You can put them on switches or you could put them onto server boards that sit in a big switch cabinet. They’re very portable.
  4. The network needs an OS (or NOS). Whether Cisco has the right OS is a point for debate, but the network definitely needs an OS and the OS needs to perform the functions that Cisco’s NX-OS carries out. It also needs to do other things to like optimize and load balance all the resources in a way that corresponds to the service level needs of the important business transactions and processes it supports. Personally, I do not see how that OS can do anything but span the whole network - including the switches."
Would all applications run this way? Probably not. But those mission critical, highly distributed, performance-is-everything apps you provide for your customers, or partners, or employees, or even large data sets, are extremely good candidates for this way of thinking.

Oh, and I wouldn't be surprised if Google, Microsoft, et al. agreed (though not necessarily as Cisco customers).

Does Nexus work? I have no idea. But I am betting that, as private clouds are built, the idea that servers are the center of the universe will be tested greatly, and the incredibly important role of the network will become more and more apparent. And when it does, Cisco may have positioned themselves to take advantage of the fun that follows.

Its just too bad that it is another single-vendor, closed source vendor offering that will take probably 5-7 years (minimum) to replicate in the open source world. At the very least, I hope Cisco is paying attention to Doug's observation that:
"[T]here will be a standardization of the hypervisor ‘interface’ between the VM and the hypervisor. This will allow a VM created on Xen to move to VMWare or Hyper-V and so on."
I hope they are openly seeking to partner with OVF or another virtualization/cloud standard to ensure portability to and from Nexus.

However, I would rather have this technology in a proprietary form than not at all, so way to go Cisco, and I will be watching you closely--via the network, of course.

Sunday, July 06, 2008

Which Sun Do You Orbit?

I love cloud computing. I love the concept, I love many of the implementations, and I love the opportunity that such a major disruption creates for entrepreneurs and tech giants alike. There is much to be excited about, though the market is in its infancy.

Or markets, if you look closely. Simon noted that at his Opscon presentation this year he ended up on the receiving end of an extended diatribe from a gentleman who was arguing determinately that software would never be portable between Amazon EC2 and Google AppEngine (which is probably very true). Simon's response was right on the money:

"I must admit I was somewhat perplexed at why this person ever thought they would and why they were talking to me about it. I explained my view but I also thought that I'd reiterate the same points here.

From the ideas of componentisation, the software stack contains three main stable layers of subsystems from the application to the framework to hardware. This entire software stack is shifting from a product to a service based economy (due to commoditisation of IT) and this will eventually lead to numerous competitive utility computing markets based upon open sourced standards at the various layers of this stack.

These markets will depend upon substitutability (which includes portability and interoperability) between providers. For example you might have multiple providers offering services which match the open SDK of Google App Engine or another market with providers matching Eucalyptus. What you won't get is substitutability from one layer of the stack (e.g. the hardware level where EC2 resides) to another (e.g. the framework level where GAE resides). They are totally different things: apples and pears."

I want to take Simon's "stack" theory and refine it further. Look at the layers of the stack, and note that there appears to be a relatively small number of companies in each that can actually drive a large following to their particular set of "standards". In the platform space, of course Google's python-focused (for now) restricted library set is where much of the focus is, but no one has counted out Bungie Labs yet, nor is anyone ignoring what Yahoo might do in this space. Each vendor has their framework (as Simon rightly calls the platform itself), but each has a few followers building tools, extensions, replications and other projects aimed at both benefiting from and extending the benefits of the platform. The diagram below identifies many of the current central players, or "suns", that exist in each technology stack today:

Credit: Kent Langley, ProductionScale

I call these communities of central players and satellites "solar systems" (though perhaps it would be more accurate to call them "nodes and edges", as we will see later).

In each solar system--say the Google AppEngine solar system--you will find an enthusiastic community of followers who thoroughly learn the platform, push its limits, and frequently (though not in every case) find economic and productivity benefits that keep them coming back. Furthermore, the most successful satellite projects will attract their own satellites, and an ever changing environment will form, though the original central players will likely maintain their role for decades (basically until the market is disrupted by an even better technical paradigm).

You already see a very strong Amazon system forming. RightScale, Enomalism and ELASTRA, are all key satellites to AWS's sun. Now you are even starting to hear about satellites of satellites in that space, such as GigaSpace's partnership with RightScale. However, if you look closely at this system, you begin to see the breakdown in the strict interpretation of this analogy, as several of the players (CohesiveFT's ElasticServer On-Demand, for example) starting to address multiple suns in a particular "stack". Thus my earlier comment that perhaps a nodal analogy is somewhat better.

The key here is that for some time from now, technologies created for the cloud will be attached to one or two so-called solar systems in the stack the technology addresses. Slowly standards will start to appear (as one solar system begins to dominate or subsume the others), and eventually the stack will play as a commodity market, though (I would argue) still centered around one key player. By the time this happens, some cross pollination of the stacks themselves will start happen (as has already happened with the prototype of GAE running in EC2), at which point new gaps in standards will be identified. This is going to take probably two decades to play out entirely, at which point the cloud market will probably already face a major disruptive alternative (or "reinvention").

I say this not to be cynical nor to pontificate for pontification's sake. I say this because I believe developers are already starting to choose their "solar system", and thus their technological options are already being dictated by which satellite technologies apply to their chosen sun. Recognizing this as OK, in fact natural to the process, and acknowledging that religious wars between platforms--or at least stacks--is kind of pointless, will make for a better climate to accelerate the consolidation of technical platforms into a small set of commodity markets. Then the real fun begins.

Of course, I'm a big fan of religious wars myself...

Tuesday, June 24, 2008

"Follow the Law" Meme Hits the Big Time

A few days ago, I checked in to my w3counter dashboard to see who was linking to my blog, and I discovered an very intelligent continuation of the "Follow the Law Computing" meme written by Greg Ness (also found on his blog). Greg's addition of the "spice trails" analogy was something new to me, and raised some interesting thoughts about what the historical significance of the cloud will be to world wide wealth distribution. There certainly has been a limited but significant wealth effect created by the Internet itself, but will the ability to physically move data and/or compute loads accelerate these trends?

Noting that I should blog about this on the plane at some point during my trip to Austin this week, I dutifully bookmarked the article for later. I had no chance to look at traffic on Monday, so it was with great shock that when I got on line this morning I saw a hockey stick graph. I investigated, and then my heart skipped a beat.

As of now, today, quotes from my "Follow the Law" post make up Nick Carr's latest post. Nick weaves together the work of Bill Thompson (which I also reference), myself and Greg to provide a clear, concise discussion of the concept of what he calls "itinerant computing". (Damn, he's good at coining these terms, isn't he?)

Ever since I discovered Nick's blog early in my career at Cassatt, I've wanted to get his attention. The Big Switch was an eye opening read--if only it served as a good counterpoint to Bill Coleman's optimistic vision. He made me look at utility computing and cloud computing with a more critical eye, and I wanted to add to his body of knowledge. I am honored to have done so in a small way.

Surprisingly, though, that wasn't whole the hockey stick trigger. Greg's post was picked up by a site called Seeking Alpha, a site I must admit I had never heard of before. Apparently a high traffic investment site (connected to Jim Cramer?), Seeking Alpha drove a record traffic load to my humble blog through a rebroadcast of Greg's post. Rereading that, I noticed that there is a very strong business message there that may in fact be actual historical significance of "itinerant computing": the flow of data and computing is simply an enabler of new business models and competitive advantages that change the face of global wealth. Being a resident of what is essentially a suburb of the Silicon Valley, I can't help but think there is more downside than upside to that story.

Finally, as I looked at the other referrers to this blog, I found an excellent summary of all of the "Follow" computing options: Follow the Sun, Follow the Moon and Follow the Law. Kevin Kelly gives very good basic definitions of each concept, and then makes the following observation:

"Most likely different industries adopt a different scenario. Maybe financial follows the moon, while commerce follows the sun, and entertainment follows the law. A single computing environment (One Machine) should not suggest homogeneity. A meadow is not homogeneous, but its does act as a coherent ecological system.

Another way to dissect the daily rhythm of the One Machine is to trace the three distinct waves of energy, data, and computation as they flow through the planetary "cloud." Each probably has its own pathways."

Amen, brother. I'll go even further. Maybe the customer server systems of a financial company follows the sun, the analytics systems follow the moon, and the trading systems follow the law. I do not mean to suggest at all that every distributed compute task will benefit from follow the law concepts. In fact, I would suggest that there are other "Follow" options that will be created over the coming decades.

All of this leads to the question of software fluidity...

Thursday, May 22, 2008

Cassatt Announces Active Response 5.1 with Demand Base Policies

Ken Oestreich blogged recently about the very cool, probably landmark release of Cassatt that just became available, Cassatt Active Response 5.1. He very eloquently runs down the biggest feature--demand based policies--so I won't repeat all of that here. What I thought I would do instead is relate my personal thoughts on monitoring based policies and how they are the key disruptive technology for data centers today.

To be sure, everyone is talking about server virtualization in the data center market today, and that's fine. It's core short-term benefit, physical system consolidation and increased utilization is key for cost-constrained IT departments, and features such as live motion and automatic backup are creating new opportunities that should be carefully considered. However, virtualization alone is limited in its applications, and does little to actually optimize a data center over time. (This is why VMWare is emphasizing management over just virtualizing servers these days.)

The technology that will make the long term difference is resource optimization: applying automation technologies to tuning how and when physical and virtual infrastructure is used to solve specific business needs. It is the automation software that will really change the "deploy and babysit" culture of most data centers and labs today. The new description will be more like "deploy and ignore".

To really optimize resource usage in real time, the automation software must use a combination of monitoring (aka "measure"), a policy engine or other logic system (aka "analyze") and interfaces to the control systems of the equipment and software it is managing (aka "respond"). It turns out that the "respond" part of the equation is actually pretty straight forward--lots of work, but straight forward. Just write "driver" like components that know how to talk to various data center equipment (e.g. Windows, DRAC, Cisco NX-OS, NetApp Data ONTAP, etc.), as well as handle error conditions by directly responding or forwarding the information to the policy engine.

The other two, however, require more immediate configuration by the end user. Measure and analyze, in fact, are where the entire set of Service Level Automation (SLAuto) parameters are defined and executed on. So, this is where the key user interface between the SLAuto system and end user has to happen.

What Cassatt has announced is a new user interface to define demand based policies as the end user sees fit. For example, what defines an idle server? Some systems use very little CPU while they wait for something to happen (at which point they get much busier), so simply measuring CPU isn't good enough in those cases. Ditto for memory in systems that are compute intensive but handle very little state.

What Cassatt did that is so brilliant (and so unique) is to allow the end user to leverage the full range of SNMP attributes for their OS, as well as JMX and even scripts running on the monitored system to create expressions that define an idle metric that is right for that system. For example, on a test system you may in fact say that a system is idle when the master test controller software indicates that no test is being run on that box. On another system, you may say its idle when no user accounts are currently active. Its up to you to define when to attempt to shut down a box, or reduce capacity for a scale-out application.

Even when such an "idle" system is identified, Cassatt gives you the ability to go further and write some "spot checks" to make sure they system is actually OK to shut down. For example, in the aforementioned test system, Cassatt may determine that its worth trying to power down a system, but a spot check could be run to determine if a given process is still running, or an administrator account is currently actively logged in to the box that would indicate to Cassatt that it should ignore that system for now.

I know of no one else that has this level of GUI configurable monitor/analyze/respond sophistication today. If anyone wants to challenge that, feel free. Now that I no longer work at Cassatt, I'd be happy to learn about (and write about) alternatives in the marketplace. Just remember that it has to be easy to configure and execute these policies, and scripting the policies themselves is not good enough.

It is clear from the rush to release resource optimization products for the cloud, such as RightScale, Scalr, and others, that this will be a key feature for distributed systems moving forward. In my opinion, Cassatt has launched itself into the lead spot for on premises enterprise utility computing. I can't wait to see who responds with the next great advancement.

Disclaimer: I am a Cassatt shareholder (or soon will be).

Wednesday, April 09, 2008

What Google App Engine is NOT

Simon Wardley wrote a post discussing the Google App Engine announcement as a "first step" for them in the "the web as an operating system space". Simon is right, but as I commented on the post:

As I just noted on my blog, perhaps it is critical to look at this from the perspective of web businesses, rather than from enterprise IT's perspective. From the former angle, this is disruptive and revolutionary; from the latter, its a no-op at this point, except perhaps for externally facing web apps.
Simon then wrote an interesting post in response, describing the opportunity that Google has created by open sourcing the App Engine SDK. His core premises can be summed up in the following quote:
Now, whilst Google hasn't provided their environment as open sourced, it has provided an open sourced SDK that "emulates all of the App Engine services on your local computer". This appears, though I'm not a python expert, to contain all the primitives and information needed to build a compatible environment to GoogleAppEngine. This allows for companies, vendors and ISPs to create competing but compatible systems. It's almost as if Google has offered a blueprint for a web operating environment and asked the rest of the community to come compete with them.
And here I have to say, "Well, true, as far as web application hosting goes. But we all know the enterprise is WAY more than that." I think if a commercial product came out that allowed anyone to build a high-scale web environment, with data storage, development tools and operations interfaces within their own infrastructure, that would be very cool. But, as someone who really understands the utility computing space, I want everyone to be clear that this wouldn't help scalability or optimizing resource usage in the following key IT areas:
  1. Portal Services - Yes, an archaic concept to some, but still a critical strategy for delivering work functionality and key information to most knowledge workers. Note that Google does not provide portal support, nor support ANY standard portal interfaces, though you may be able to hack that in Python.

  2. SOA architectures - While it is theoretically possible to build a REST service in App Engine, there is no mechanism to host any other form of services. Yes, you could theoretically leverage services external to the Python app, but this would probably require services and GUI to be located in the same network, to avoid latency issues. Not to mention the fact that there is nothing resembling a messaging infrastructure, or Enterprise Service Bus.

  3. Business Process Automation - This is one of key tactics for gaining business agility, in my opinion, and while I wouldn't doubt someone will write an app to do BPA/I in App Engine, it will be expensive from a resource usage perspective (lots of in/out traffic, storage for quiesced processes and so on).

  4. EAI - Enterprise integration is still the most customized element of IT today, and, as noted in the last two points, there is nothing provided by Google at this point to help with data or application level integration; no data transformation (ala Informatica), no messaging engine, no business process automation, etc., etc., etc.

  5. HPC - Yes, Google is amazingly scalable, but they went out of their way to insist that App Engine is not a grid. It is not designed to--nor do you have the quota to allow you to--send arbitrary compute intensive jobs to the engine for processing.

  6. Server and desktop virtualization - No one does desktop in the cloud today, as far as I know, but Google doesn't even provide virtual servers--useful for hosting and maintenance of legacy applications, if nothing else. I suppose you could run out and convert your productivity apps to Google Apps, your email to GMail, etc., but what about print services?
Not to mention the fact that Google provides no service level guarantees (though I think they will probably do something here when they go GA), no premium support, no integration services, no live customer support (that I know of); in other words, there is a distinct lack of a "throat to choke" here.

Thus, I think most enterprises need to look at Amazon and Google services as just that--services that can be leveraged within their own architectures when it makes sense, rather than wonder-tools that can replace their entire IT infrastructure expenditure. Again, there is probably more bang for the buck today in converting that existing infrastructure into a utility, unless your data center hosts only web-facing applications...but then there is the expense of rewriting them entirely in Python, which may cancel out a tremendous amount of the cost benefits of using App Engine.

So, Simon, I share your excitement about the future of scalable web applications, but my point remains--this is largely a no-op for most enterprise IT organizations.

Friday, April 04, 2008

John Willis Honors Me with Inaugural Cloud Cafe Podcast

I am the inaugural guest in John Willis's Cloud Cafe podcast series. I couldn't be more honored.

Those of you who have been following this whole "what is Cloud Computing" debate may have had the opportunity to see the conversations between several bloggers regarding how to define cloud computing and related technologies. John Willis, of the John Willis ESM Blog, is making a key contribution by taking on the challenge of classifying vendors in this space. As I had some issues with his classification of Cassatt, he thought the best way to resolve that was to invite me to launch his new series.

Two things were resolved in this podcast.

First, I learned first hand what a classy guy John is. He handled the interview very well, let me talk my butt off (a talent I got from my minister mother, I think) and had several observations over the course of the conversation that showed his tremendous experience in the enterprise systems management space. I feel quite sheepish that I ever hinted that he wasn't being forthright with his audience. Lesson gratefully learned; apology gladly offered.

Second, John and I were always much closer in our visions of cloud computing, utility computing and enterprise systems than it might have appeared at first. Our conversation raged from the aforementioned "what is cloud computing" question, to topics such as:

  • the relationship between cloud and utility computing,
  • the cultural challenge facing enterprises seeking the economic returns of these technologies,
  • how cloud and utility computing revolutionize performance and capacity planning, and
  • where Hadoop and CloudDB fit into all of this.
In the end, I think John and I agreed that cloud computing is more than just virtualization on the Internet. I very much enjoyed the conversation, and I hope you will take the time to listen to this podcast.

Got questions or comments? Post them here or on John's blog; I will check both.

Finally, I will be working to get Cassatt's entry in John's classifications updated as a result of the discussion.

Tuesday, January 29, 2008

One Step To Prepare For Cloud Computing

Some of you may be wondering why I am making such a big stink about software architecture on a blog about service level automation (SLAuto). Well, as Todd Biske points out, "the relationships (and potentially collisions) between the worlds of enterprise system management, business process management, web service management, business activity monitoring, and business intelligence" are easier to resolve if the appropriate access to metrics is provided for a software service. For SLAuto, this means the more feedback you can provide from the service, process, data and infrastructure levels of your software architecture, the easier it is to automate service level compliance.

Let's look at a few examples for each level:

  • Service/Application: From the end user's perspective, this is what service levels are all about. Key metrics such as transaction rates (how many orders/hour, etc.), response times, error rates, and availability are what the end users of a service (e.g. consumers, business stakeholders, etc.) really care about.
  • Business Process: Business process metrics can warn the SLAuto environment about cross-service issues, business rule violations or other extraordinary conditions in the process cycle that would warrant capacity changes at the BPM or service levels.
  • Data Storage/Management: Primarily, this layer can inform the SLAuto system about storage needs and storage provisioning, which in turn is critical to automated deployment of applications into a dynamic environment.
  • Infrastructure: This is the most common form of metric used to make SLAuto decisions today. Such metrics as CPU utilization, memory utilization and I/O rates are commonly used in both virtualized and non-virtualized automated environments.

As noted, digital measurement of these data points can feed an SLAuto policy engine to trigger capacity adjustment, failure recover or other applicable actions as necessary to remain within defined service thresholds. While most of the technology required to support SLAuto is available, the truth is that the monitoring/metrics side of things is the most uncharted territory. As an action item, I ask all of you to take Todd's words of wisdom into account, and design not only for functionality, but also manageability. This will aid you greatly in the quest to build fluid systems that can best take advantage of utility infrastructure today.

Wednesday, December 05, 2007

Oracle makes DBs fluid(?)

Well, Oracle is making a play at making databases portable via virtualization. This was a problem in the pure VMWare world, as no one was comfortable with running their production databases in a VM. I'm not saying that is instantly solved by any means, but "certified by Oracle" is a hell of pitch...

http://www.networkworld.com/community/node/21849

How fluid is your software?

I come from a software development background, and I can never quite get the itch to build the perfect architecture out of my system. That's partly why it is so hard for me to blog about power, even though it is an absolutely legitimate topic, and a problem that needs to be attacked from many fronts. However, power is not a software issue, it is a hardware and facilities issue, and my heart just isn't there when it comes to pontificating.

All is not lost, however, as Cassatt still plays in the utility computing infrastructure world, and
I get plenty of exposure to dynamic provisioning, service level automation (SLAuto) and the future of capacity on demand. And to that end, I've been giving a lot of thought to the question of what, if any, software architecture decisions should be made with utility computing in mind.

While a good infrastructure platform won't require wholesale changes to your software architecture (and none at all, if you are willing to live with consequences that will become obvious later in this discussion), the very concept of making software mobile--in the changing capacity sense, not the wireless device sense--must lead all software engineers and architects to contemplate what happens when their applications are moved from one server to another, or even one capacity provider to another. There are a myriad of issues to be considered, and I aim to cover just a few of them here.

The term I want to introduce to describe the ability of application components to migrate easily is "fluidity". The definition of the term fluidity includes "the ability of a substance to flow", and I don't think its much of a stretch to apply the term to software deployments. We talk about static and dynamic deployments today, and a fluid software system is simply one that can be moved easily without breaking the functionality of the system.

An ideally fluid system, in my opinion, would be one that could be moved in its entirety or in pieces from one provider to another without interruption. As far as I know, nobody does this. (3TERA claims they can move a data center, but as I understand it you must stop the applications to execute the move.) However, for purely academic reasons, let's analyze what it would take to do this:


  1. Software must be decoupled from physical hardware. There are currently two ways to do this that I know of:
    • Run the application on a virtual server platform
    • Boot the file system dynamically (via PXE or similar) on the appropriate capacity
  2. Software must loosely coupled from "external" dependencies. This means all software must be deployable without hard coded reference to "external" systems on which it is dependent. External systems could be other software processes on the same box, but the most critical elements to manage here are software processes running on other servers, such as services, data conduits, BPMs, etc.
  3. Software must always be able to find "external" dependencies. Loose coupling, as most of you know, is sometimes easier said than done, especially in a networked environment. Critical here is that the software can locate, access and negotiate communication with the external dependencies. Service registries, DNS and CMDB systems are all tools that can be used to help systems maintain or reestablish contact with "external" dependencies.
  4. System management and monitoring must "travel" with the software. Its not appropriate for a fluid environment to become a Schrodinger's box, where the state of the system becomes unknown until you can reestablish measurement of its function. I think this may be one of the hardest requirements to meet, but at first blush I see two approaches:
    • Keep management systems aware of how to locate and monitor systems as they move from one system to another.
    • Allow monitoring systems to dynamically "rediscover" systems when they move, if necessary. (Systems maintaining the same IP address, for instance, may not need to be rediscovered.)

This is just a really rough first cut of this stuff, but I wanted to put this out there partly to keep writing, and partly to get feedback from those of you with insights (or "incites") into the concept of software fluidity.

In future posts I'll try to cover what products, services and (perhaps most importantly) standards are important to software fluidity today. I also want to explore whether "standard" SOA and BPM architectures actually allow for fluidity. I suspect they generally do, but I would not be suprised to find some interesting implications when moving from static SOA to fluid SOA, for instance.

Respond. Let me know what you think.

Tuesday, November 06, 2007

Beating the Utility Computing Lockdown, Part 2

Well, not long after I posted part 1 of this series, Bert noted that he agreed with my assessment of lock-in, then preceded to note how his (competitive to my employer's) grid platform was the answer.

Now, Bert is just having fun cross promoting on a blog with ties to a competitor, but I think its only fair to note that no one has a platform that avoids vendor lock-in in utility computing today. The best that someone like 3TERA (or even Cassatt) can do is give you some leverage between the organizations that are utilizing their platform; however, to get the portability he speaks of, you have to lock your servers, (and possibly load balancers, storage, etc-etc-etc) into that platform. (Besides, as I understand it, 3TERA is really only portable at the "data center" level, not the individual server level. I suppose you could define a bunch of really small "data centers" for each application component, but in a SOA world, that just seems cumbersome to me.)

Again, what is needed is a truly open, portable, ubiquitous standard for defining virtual "components" and their operation level configurations that can be ported and run between a wide variety of virtualization, hardware and automation platforms. (Bert, I've been working on Cassatt--are you willing to push 3TERA to submit, cooperate on and/or agree to such a standard in the near future?) As I said once before, I believe the file system is the perfect place to start, as you can always PXE boot a properly defined image on any compatible physical or virtual machine, regardless of the vendor. (This is true for every platform except for Windows--c'mon Redmond, get with the program!) However, I think the community will have the final say here, and the Open Virtual Format is a hell of a start. (It still lacks any tracking of operation level configurations, such as "safe" CPU and memory utilization thresholds, SNMP traps to monitor for heartbeats, etc.)

Unfortunately, those standards aren't baked yet. So, here's what you can do today to avoid vendor lock-in with a capacity provider tomorrow. Begin with a utility computing platform that you can use in your existing environment today. Ideally, that platform:

  1. Does not require you to modify the execution stack of your application and server images (e.g.
    • no agentry of any kind that isn't already baked into the OS,
    • no requirement to run on virtualization if that isn't appropriate or cost effective,
  2. Uses a server/application/whatever imaging format that is open enough to "uncapture" or translate to a different format by hand if necessary--again, I like our approach of just capturing a sample server file system and "generalizing" it for replication as needed. It's reversible, if you know your OS well.)
  3. Is supported by a community or business that is committed to supporting open standards wherever appropriate and will provide a transition path form any proprietary approach to the open approach when it is available.

I used to be concerned that customers would ask why they should convert their own infrastructure into a utility (if it was their goal to use utility computing technology to reduce their infrastructure footprint). I now feel comfortable that the answer is simply because there is no safe alternative for large enterprises at this time. Leave alone the issue of security (e.g. can you trust your most sensitive data to S3), and the fact that there is little or no automation available to actually reduce your cost of operations in such an environment, there are many risks to consider with respect to how deeply you are willing to commit to a nascent marketplace today.

I encourage all of you to get started with the basic concepts of utility computing. I want to talk next about ways to cost justify this activity with your business, and talk little about the relationship between utility computing and data center efficiency.

Monday, October 15, 2007

Is your software ready for utility computing?

I've been seeing more thoughts on the effect of utility computing on software architectures lately, and one very well stated argument comes from Alistair Croll, Vice President of Product Management and co-founder of Coradiant, a performance tool company. Though clearly self-serving, his message is simple: if you are going to pay by the cycle--or even just share cycles between applications--you'd better make sure your software takes as few cycles as possible to do its job well.

This is one of the unforeseen effects of "paying for what you use", and I have to say its an effect that should scare the heck out of most enterprise IT departments. Although I would argue part of that fear should come from the exposure of lousy coding in most custom applications, the worst part is the lack of control most organizations will have over the lousy coding in the packaged applications they purchased and installed. Suddenly, algorithms matter again in all phases of software development, not just computing intensive steps.

The worst offender here will probably the the user interface components: Java SWING, AJAX and even browser applications themselves. To the extent that these are hosted from centralized computing resources (and even most desktops fall into this category in the some visionaries' eyes), then the incredible amount of constant cycling, polling and unnecessary redrawing will be painfully obvious in the next 10 years or so.

I have always been a strong proponent for not over-engineering applications. If you can meet the business's ongoing service levels with an architecture that cost "just enough" to implement, you were golden in my book. However, utility computing changes the mathematics here significantly, and that key phrase of "meet the business's ongoing service levels" comes much more into play. Ongoing service levels now include optimizations to the cost of executing the software itself; something that could be masked in a underutilized, siloed-stack world.

The performance/optimization guys must be loving this, because they now have a product that should see immediate increase in demand. If you are building a new business application today, you had better be:

  1. Building for a service-based, highly distributed, utility infrastructure world, and
  2. Making sure your software is a cheap to run as possible.

Number 2 above itself implies a few key things. Your software had better be:

  • as standards based as possible--making it possible for any computing provider to successfully deploy, integrate and monitor your application;
  • as simple to install, migrate and upgrade remotely as possible--to allow for cheap deployment into a competitive computing market;
  • as efficient to execute as possible--each function should take as few cycles as possible to do its job

The cost dynamics will be interesting to note, especially their effects on the agile processes, SOA, and ITIL movements. I will keep a careful tab on this, and will share my ongoing thoughts in future posts.

Monday, September 24, 2007

Service-Oriented Everything...

Agility Principle: Service-Oriented Network Architecture (eBiz: Mark Milinkovich, Director, Service-Oriented Network Architecture, Cisco Systems): Cisco is touting the network as the center of the universe again, but this article is pretty close to the truth about software and infrastructure architectures we are moving to. Most importantly, Mark points out that there is a three layer stack that actually binds applications to infrastructure:

  • Applications layer - includes all software used for business purposes (e.g., enterprise resource planning) or collaboration (e.g., conferencing). As Web-based applications rely on the Extensible Markup Language (XML) schema and become tightly interwoven with routed messages, they become capable of supporting greater collaboration and more effective communications across an integrated networked environment.

  • Integrated network services layer - optimizes communications between applications and services by taking advantage of distributed network functions such as continuous data protection, multiprotocol message routing, embedded QoS, I/O virtualization, server load balancing, SSL VPN, identity, location and IPv6-based services. Consider how security can be enhanced with the interactive services layer. These intelligence network-centric services can be used by the application layer through either transparent or exposed interfaces presented by the network.

  • Network systems layer - supports a wide range of places in the network such as branch, campus and data center with a broad suite of collaborative connectivity functions, including peer-to-peer, client-to-server and storage-to-storage connectivity. Building on this resilient and secure platform provides an enterprise with the infrastructure on which services and applications can reliably and predictably ride.
Of course, he's missing a key layer:
Physical infrastructure layer - represents the body of physical(and possibly virtual) infrastructure components that support the applications,network services and network systems, not to mention the storage environment,management environment and, yes, Service Level Automation (SLAuto) environment.
It is important to note that, while the network may becoming a computer in its own right, it still requires physical infrastructure to run. And all of these various application, integrated network, and network systems services that Mark mentions not only depend on this infrastructure, but can actually be loosely coupled to the physical layer in a way that augments the agility of all four layers.

For example, imagine a world where your software provisioning is completely decoupled from your hardware provisioning. In other words, adding an application to your production data center doesn't require you to predict exactly what load the application is going to add to the network, server or storage capacity. Rather, you simply load the application into the SLAuto engine, let traffic start to arrive, measure the stress on existing capacity, and order additional hardware as required. Or, better yet, order hardware at the end of a quarter based on trend analysis from the previous quarter. No need for the software teams and the hardware teams to even talk to each other.

I will admit that it is unlikely that many IT departments will ever get to that "pie-in-the-sky" scenario--for some the risk of not guessing high enough on capacity overwhelms the cost of predicting short to medium term load. However, SLAuto allows you to get past the problems of siloed systems, such as "hitting the ceiling" in allocated capacity. Even if the SLAuto environment runs out of excess physical capacity, it can borrow the capacity it needs for high priority systems from lower priority applications.

The best part is that, since the SLAuto environment tracks every action it takes, there are easy ways to get reports showing everything from capacity utilization trend analysis to cost of infrastructure for a given application.

Back to Mark's article, though. It is good to see some consensus in the industry on where we are moving, even if each vendor is trying to spin it as if they are the heart of the new platform. In the end though, if the network is indeed the computer, the network and the data center will need operating systems. Mark has entire sections dedicated to designing for application awareness (this is where most data center automation technologies fall woefully short), and designing for virtualization (including all aspects of infrastructure virtualization). He is right on the money here, but there needs to be something that coordinates the utilization of all of these virtualized resources. This is where SLAuto comes in.

Most importantly, don't forget to integrate SLAuto into all four layers. Make sure that each "high" layer talks to the layers below it in a way that decouples the higher layer from the lower layer. Make sure that each lower layer uses that information to determine what adjustments it needs to make (including, possibly, to send the information to an even lower layer). And make sure your physical infrastructure layer is supported by an automation environment that can adjust capacity usage quickly and painlessly as applications, services and networks demand.

As you prepare your service oriented architecture of the future, don't forget the operations aspects. We are on the brink of an automated computing world that will change the cost of IT forever. However, it will only work for you if you take all of the components involved in meeting service levels/operation levels into account.

Thursday, August 16, 2007

Links - 08/16/2007

Convergence of Virtualization and Green Data Center Trends Could Be Perfect Timing for Microsoft (ITBusinessEdge: Kachina Dunn): Kachina notes that Microsoft may gain the most from the way the virtualization market is setting up. Combine that with the comments that Chris Kanaracus made about Microsoft's "compute cloud" strategy, and you begin to get the feeling that Mr. Softy may out-engineer its rivals in utility computing technology. I hope not, because (like VMWare) they are still obsessed with locking people into their platform. We need that utility computing portability standard!

Ozzie Reveals More Details of Cloud Development Platform (RedmondDeveloper: Chris Kanaracus): Another good breakdown of Microsoft's "PaaS" (Platform as a Service) play. Ray's comments at the end of the article give me the feeling that Google, Amazon, Salesforce.com and others are not free and clear of MSFTs influence, yet.

"We believe we are the only company with the platform DNA that's necessary to viably deliver this highly leveragable platform approach to services. And we're certainly one of the few companies that has the financial capacity to capitalize on this sea change, this services transformation."

Grid computing: Term may fade, but features will live on (ComputerWorld: Barbara DePompa): Barbara discusses the view of many that the term "grid computing" may go extinct in the face of virtualization and utility computing. My own opinion is that "grid" has actually had its definition narrowed back to its roots: grid computing platforms provide resource allocation to job-based computing processes, like batch data processing, image rendering and HPC. Utility computing is the term that applies to all "computing on demand" applications, including grid applications.

Automation and going green in the data center (NetworkWorld: NewDataCenter): Someone at NetworkWorld saw the light at Next Generation Data Center in San Francisco earlier this month. I just wanted to welcome them aboard, and invite them to explore the importance of SLAuto to both automation and green practices.

Friday, July 20, 2007

Where's the standard, bub?

Simon Wardley's Bits and Pieces blog has an interesting post breaking down what he thinks are the three key markets for utility computing:

  • Saas: Software as a Service

  • FaaS: Frameworks as a Service

  • HaaS: Hardware as a Service
Such as it goes, this is OK, but I think his most interesting comments surrounded the need for Common Service Providers in each of these areas, and the need for portability across those providers, and the mechanisms that he thinks will drive the standards that will enable portability. To quote:

The issues and the needs of a competitive utility computing market are also the same at each level - portability, multi-providers and agreed standards and solves the same class of problems - disaster recovery, scalability, efficiency and exist costs.

In today's world the fastest way to achieve a standard is not through committee, conversation or whitepapers but through the release and adoption of not only a standard but also an operational means of achieving a standard.

Hence such utility computing standards will only be achieved through the use of open source, without any one CSP being strategically disadvantaged to any owner of the standard.

To be sure, this is controversial, but it aligns nicely with an observation that I and others have had about Xen, and why it may struggle to supersede VMWare, despite being freely distributed by just about every major OS vendor out there.

One of my colleagues put it best in an email:

As to the larger question of why Xen is failing miserably, I would like to profess this opinion -- Storage. The Xen / KVM / Linux / RedHat community botched storage. Hence, they are failing in the marketplace.

To elaborate:

Virtualization brings two important benefits:
  1. Seperates the OS+App stack from the underlying hardware

  2. Enables you to package the OS+App stack into a VM that you can fling around with ease....this is the storage angle.
Xen accomplished (1) reasonably well.

Xen failed miserably with (2), i.e Storage. VMware solved the storage issues admirably well. So long as Xen solutions do not work well in the "copy virtual disk files around and run them anywhere" model, Xen will not succeed.
In other words, Xen has issues with portability by not providing a file representation of virtual machine storage that can be moved between disparate physical systems with ease. I know there are some virtual appliance vendors out there that do Xen, so maybe the problem is solved with their technologies. However, there is no standard proposed by Xen, and thus there is no portable standard for Xen VMs as files.

Alas, VMWare has a nice portable file representation of a VM. Granted, there are portability issues there, as well, but by and large VMWare has a much better solution to portability--within VMWare hosts. Unfortunately, there is still no solution (that I know of) that will run the same file system on both VMWare and Xen. Thus, no universal portability is coming soon from the VM space.

Recently, I have been telling anyone who will listen that this nascent utility computing market is still searching for a standard for server (VM/framework/application/whatever) portability across disparate utility computing service providers. I like the concept of a virtual appliance, but we need a (non-proprietary) standard, or we need another portability mechanism besides VMs. (As a side note to my new friends at eCloudM--this is definitely an opportunity, though it may not meet your criteria.)

Otherwise, utility computing will be "choose your vendor and build your software accordingly", not "build your software as you like and choose any vendor you want".

Please, if I am way off, correct me with a comment or a blog post. I would love to find out I am wrong about this...

Monday, May 07, 2007

VMWare TSX and Reducing Complexity in the Data Center

I had a busy last few days of the week last week. On Wednesday, I attended VMWare TSX in Las Vegas, and on Friday, I had the chance to hear Bill Coleman speak about utility computing, and the events that have lead up to its sudden resurgence in the market place.

All in all, TSX was one of the most informative VMWare events I have ever attended. I only had the chance to attend three sessions--CPU scheduling, ESX networking and DRS/HA--but all three were packed with useful information. (The slides linked here are from a TSX conference in Nice, Italy, April 3-5, 2007. They are a little different from the slides I saw, but are similar enough to communicate the basic concepts.)

If you don't know much about VMWare CPU scheduling, check out that deck. Sure, its basic scheduling stuff, but it is very helpful when it comes to understanding how VMWare settings affect processor share. The networking deck is also critical if you must deploy network applications to virtual machines.

The DRS/HA deck has some helpful tips, but also clearly demonstrates the limited scale of DRS/HA. A 16 physical node limit per HA cluster, for example, is going to be problematic for most medium to large data centers. Furthermore, these are very server-centric technologies; the concept of Service Level Automation is clearly missing, as there is no concept at all of a service or application to be measured. They are hinting at a few new app-level monitors in a later release, but I just don't think monitoring service levels from a business perspective is very important to VMWare.

Bill's speech to the IT department of a large manufacturer was very interesting, if for no other reason than it clearly spelled out the argument for reducing complexity in the data center. (For a quick and dirty argument, see this article.) We are definitely at a cross roads now; IT can choose to attack complexity with people or technology. Most of us are betting technology will win. Furthermore, Bill told the assembled techies, the early adopters of any platform technology get the best jobs when that platform becomes mainstream. Almost nobody is predicting that utilty computing will fail in the long term, so now is the time to jump aboard and get involved.

I should have some time to complete the Service Level Automation Deconstructed series this week. Stay tuned for more.

Monday, April 16, 2007

Two articles mentioning Service Level Automation

I recently set up Google Alerts to inform me about references to Service Level Automation on the web. There were many articles returned this week, (many of which involved Cassatt), but I found two additional articles of note. Each makes mention of Service Level Automation, and represents the growing interest in this approach.

The first is from the March 2006 issue of ACM Queue, entitled "Under New Management". The article was written by Duncan Johnston-Watt, the founder of Enigmatec. Johnston-Watt does an excellent job of outlining basic issues around one possible architecture for an autonomic data center. As expected for Enigmatec, its a policy automation focused approach, and is, in fact, one of the few articles from a policy engine vendor that I have see where the term Service Level Automation is used correctly.

Unfortunately, I don't necessarily agree that Johnston-Watt's architecture is optimal enterprise data centers. (It requires development of process automation flows to "optimize operational processes"--a significant amount of work that is prone to introducing new inefficiencies. It is also agent based, which alters the footprint of the software stacks being run in the data center, and can negatively affect the execution and architecture of the applications being managed.) All in all, though, there is some excellent information here for those thinking about Service Level Automation holistically, across the entire data center.

The other article, entitled "Virtually Speaking: Xen Achieves Higher Enterprise Consciousness" was published April 6, 2007 on ServerWatch. In the last few paragraphs of the article, uXcomm's aquisition of Virtugo is covered. In it, uXcomm claims the combination their Xen management tools and Virtugo's VMWare tools "fills a gap not just in uXcomm's portfolio but in the virtual landscape as well. Until now [...] there was a gap between service-level automation offerings and performance management products."

Hmmm. Not sure how providing SLA for only virtual servers counts as filling gaps...but, even so, I hope uXcomm is aware that everyone in this space realizes the need for resource optimization includes VM performance management. I guess my question would be, what is uXcomm doing about marrying Service Level Automation to the rest of the data center?

As a side note, I know that I owe two more articles on my Service Level Automation Deconstructed series. I am working on the "Analyze" overview now, but have discovered some interesting technology to discuss here that I am reading up on now.

Monday, February 26, 2007

When CPU utilization is not enough...

Where have I been, you might ask?

Much to my delight, the last few weeks have been filled with customer activity, ranging from helping a Service Level Automation-enabled appliance for a major software company, to assisting the financial wing of one of the world's largest manufacturers to experience first hand the benefits of utility computing.

The latter runs an application that is highly dependent on Windows Terminal Services to deliver a client-server UI to thousands of retail outlets world-wide. Uptime is critical to this application, as customers will go elsewhere for financing if this application doesn't confirm credit within minutes of a purchase decision.

Unfortunately, WTS is also a very inefficient consumer of server payload. It is a session-based infrastructure, which means that a user will be attached to a specific server for their desktop access until they either log out or are kicked off. If 15 user sessions share a physical server, there is no way to predict the load on the system. All 15 sessions could be idle, or all could quickly start consuming cycles simultaneously.

This gives me my first really good example of when CPU and/or memory utilization are not good Service Level metrics on their own. Imagine an environment using WTS to support hundreds of users. These users use their Windows sessions to run a variety of tasks, much like any Windows user. Some tasks use a high level of CPU and memory, others very little. Quite often, the session will sit idle for several minutes.

Now, if you create lots of sessions because the CPU is idle, you could end up with problems if they all get active at the same time (say right after lunch). If you stop creating sessions on a server because CPU utilization is high, you may end up with a highly under utilized server when one user's game of Quake wraps up.

That's not to say that CPU or memory utilization aren't an important part of the Service Level "equation". The truth is, there are several metrics that apply to WTS capacity: sessions, CPU utilization, memory utilization, licenses, etc. Since determining Service Level compliance probably involves evaluating the relationships between several of these metrics at once, there will most likely be one or two compound metrics based on mathematical equations combining these "root" metrics in a way that reasonable thresholds can be set.

Another interesting observation is that this is a lot like the Java EE Service Level Automation problem. (Thanks to Luis Cuyun for pointing this out.) While most horizontally scalable application tiers can be scaled up and down as a unit (i.e. "add a node/remove any node"), app servers, hypervisors and (now) WTS all must be monitored as a unit, but managed on a per server basis (in this case, "add a node/remove this specific node"). This is because the "instances" that each of these software resources are hosting are "sticky" to a server (VMotion not withstanding), and you don't want to shut down any server when capacity is not longer needed, you want to only remove the specific servers with no live sessions remaining on them.

(Speaking of VMotion, one of the things that both Java EE and WTS will require to be really optimizable is the ability to move live services/sessions from one server to another in real-time. Anyone know of a technology addressing either of these?)

The good news for a good Service Level Automation environment (*ahem*) is that if one of these problems (Java EE, virtual servers or WTS) is solved correctly, the same basic technology can be applied to all of them. That's not to say that anyone is doing this for WTS today (to my knowledge, no one is), but I like the idea that the use cases that apply to Java EE SLA also apply to WTS SLA.

I'm hoping to have more to write about this as this pilot continues. In the meantime, anyone with Windows experience is welcome and encouraged to contribute their two cents to this discussion. In particular, are there any tried and true service level metrics for WTS that are being used out there? In general, there are so many moving parts here, that I am sure there are many critical factors to Service Level Automation of WTS that I have not covered, or even considered.