Showing posts with label service level automation. Show all posts
Showing posts with label service level automation. Show all posts

Wednesday, July 16, 2008

Watch out for Cisco, kids!

What is the most important enabler of distributed computing architectures, such as cloud oriented architectures? What is the one thing that has to be in ample supply before the other elements of the data center come into play? Is it the number of servers or CPU power available for computing? Is it the size and speed of the disks and network storage devices? Is it the distributed software architectures themselves?

My answer? None of the above. It's network bandwidth, baby, all the way.

Why? Well, let's break down where the costs of distributed systems lay. We all know that CPU capabilities double roughly every couple of years, and we also know that disk I/O slows those CPUs down, but not at the rate that network I/O typically does. When designing distributed systems, you must first be aware of network latency and control traffic between components to have any chance in heck of meeting rigorous transaction rate demands. The old rule at Forte Software, for what it's worth, was:

  • First reduce the number of messages as much as possible
  • Then reduce the size of those messages as much as possible
Increase adherence to those rules, and your software would outperform less optimized applications every time. It was easy to look like a performance tuning genius in those days.

What is exciting about today's environment, however, is that network technology is changing rapidly. Bandwidth speeds are increasing quickly (though not as fast as CPU speeds), and this high speed bandwidth is becoming more ubiquitous world wide. Inter-data-center speeds are increasingly mind boggling, and WAN optimization apparently has removed much of the fear of moving real-time traffic between geographically disparate environments.

All of this is a huge positive to cloud oriented architectures. When you design for the cloud, you want to focus on a few key things:
  • Software fluidity - The ability of the software to run cleanly in a dynamic infrastructure, where the server, switch port, storage and possibly even the IP address changes day by day or minute by minute.

  • Software optimization - Because using a cloud service costs money, whether billed by the CPU hour, the transaction or the number of servers used, you want to be sure you are getting your money's worth when leveraging the cloud. That means both optimizing the execution profile of your software, and the use of external cloud services by the same software.

  • Scalability - This is well established, but clearly your software must be able to scale to your needs. Ideally, it should scale infinitely, especially in environments with highly unpredictable usage volume (such as the Internet).

Achieving any of these in an environment where your network bandwidth is constricting your options is nearly impossible.

Oh, and one more thing. The network is the first element of your data center that sees load, failure and service level compliance. Think about it--without the eyes of the network, all of your other data center elements become black boxes (though often physically with those annoying beeps and little blinking orange lights). What are the nerves in the data center nervous system? Network cables, I would say.

Today I saw two really good posts about possible network trends driven by the cloud, and how Cisco's new workhorse leverages "virtualized" bandwidth and opens the door to commodity cloud capacity. The first is a post by Douglas Gourlay of Cisco, which simply looks at the trends that got us to where we are today, and further trends that will grease the skids for commodity clouds. I am especially interested in the following observations:
"8) IP Addressing will move to IPv6 or have IPv4 RFCs standardized that allow for a global address device/VM ID within the addressing space and a location/provider sensitive ID that will allow for workload to be moved from one provider to another without changing the client’s host stack or known IP address ‘in flight’. Here’s an example from my friend Dino.

9) This will allow workload portability between Enterprise Clouds and Service Provider Clouds.

10) The SP community will embrace this and start aggressively trying to capture as much footprint as possible so they can fill their data centers to near capacity allowing for them to have the maximum efficiency within their operation. This holds to my rule that ‘The Value of Virtualization is compounded by the number of devices virtualized’.

11) Someone will write a DNS or a DNS Coupled Workload exchange. This will allow the enterprise to effectively automate the bidding of workload allocation against some number or pool of Service Providers who are offering them the compute, storage, and network capacity at a given price. The faster and more seamless the above technologies make the shift of workload from one provider to another the simpler it is in the end for an exchange or market-based system to be the controlling authority for the distribution of workload and thus $$$’s to the provider who is most capable of processing the workload."

The possibility that IP addresses could successfully travel with their software payloads is incredibly powerful to me, and I think would change everything for both "traditional" VM users, as well as the virtual appliance world. The possibility that my host name could travel with my workload, even as it is moved in real time from one vendor to another is, of course, cloud computing nirvana. To see someone who obviously knows something about networking and networking trends spell out this possibility got my attention.

(Those who see a fatal flaw in Doug's vision are welcome to point it out in the comments section below, or on Doug's blog.)

The second post is from Hurwitz analyst, Robin Bloor, who describes in brilliant detail why Cisco's Nexus 7000 series is different, and why it could very well take over the private cloud game. As an architecture, it essentially makes the network OS the policy engine for controlling provisioning and load balancing, though with bandwidth speeds that blow away today's standards (10G today, but room for 40G and 100G standards in the future). Get to those speeds, and all of a sudden something other than network bandwidth is your restricting function in scaling a distributed application.

I have been cautiously excited about the Nexus announcement from the start. Excited because the vision of what Nexus will be is so compelling to me, for all of the reasons I describe above. (John Chambers, CEO of Cisco, communicates that vision in a video that accompanied the Nexus 5000 series launch.) Cautious, because it reeks of old-school enterprise sales mentality, with Cisco hoping to "own" whole corporate IT departments by controlling both how software runs, and what hardware and virtualization can be bought to run it on. Lock-in galore, and something the modern, open source aware corporate world may be a little uneasy about.

That being said, as Robin put it, "In summary: The network is a computer. And if you think that’s just a smart-ass bit of word play: it’s not."

Robin further explains Cisco's vision as follows:

"Cisco’s vision, which can become reality with the Nexus, is of a data center that is no longer defined by computer architecture, but by network architecture. This makes sense on many levels. Let’s list them in the hope of making it easier to understand.

  1. Networks have become so fast that in many instances it is practical to send the the data to the program, or to send the program to the data, or to send both the program and the data somewhere else to execute. Software architecture has been about keeping data and process together to satisfy performance constraints. Well Moore’s Law reduced the performance issue and Metcalfe’s Law opened up the network. All the constraints of software architecture reduced and they continue to reduce. Distributing both software and data becomes easier by the year.
  2. Software is increasingly being delivered as a service that you connect to. And if it cannot deliver the right performance characteristics in the place where it lives, you move it to a place where it can.
  3. Increasingly there is more and more intelligence being placed on the switch or on the wire. Of course Cisco has been adding intelligence to the switch for years. Those Cisco firewalls and VPNs were exactly that. But also, in the last 5 years, agentless sotware (for example some Intrusion Detection products) has become prominent. Such applications simply listen to the network and initiate action if they “don’t like what they hear”. The point is that applications don’t have to live in server blade cabinets. You can put them on switches or you could put them onto server boards that sit in a big switch cabinet. They’re very portable.
  4. The network needs an OS (or NOS). Whether Cisco has the right OS is a point for debate, but the network definitely needs an OS and the OS needs to perform the functions that Cisco’s NX-OS carries out. It also needs to do other things to like optimize and load balance all the resources in a way that corresponds to the service level needs of the important business transactions and processes it supports. Personally, I do not see how that OS can do anything but span the whole network - including the switches."
Would all applications run this way? Probably not. But those mission critical, highly distributed, performance-is-everything apps you provide for your customers, or partners, or employees, or even large data sets, are extremely good candidates for this way of thinking.

Oh, and I wouldn't be surprised if Google, Microsoft, et al. agreed (though not necessarily as Cisco customers).

Does Nexus work? I have no idea. But I am betting that, as private clouds are built, the idea that servers are the center of the universe will be tested greatly, and the incredibly important role of the network will become more and more apparent. And when it does, Cisco may have positioned themselves to take advantage of the fun that follows.

Its just too bad that it is another single-vendor, closed source vendor offering that will take probably 5-7 years (minimum) to replicate in the open source world. At the very least, I hope Cisco is paying attention to Doug's observation that:
"[T]here will be a standardization of the hypervisor ‘interface’ between the VM and the hypervisor. This will allow a VM created on Xen to move to VMWare or Hyper-V and so on."
I hope they are openly seeking to partner with OVF or another virtualization/cloud standard to ensure portability to and from Nexus.

However, I would rather have this technology in a proprietary form than not at all, so way to go Cisco, and I will be watching you closely--via the network, of course.

Thursday, June 12, 2008

"Follow the law" computing

A few days ago, Nick Carr worked his usual magic in analyzing Bill Thompson's keen observation that every element of "the cloud" eventually boils down to a physical element in a physical location with real geopolitical and legal influences. This problem was first brought to my attention in a blog post by Leslie Poston noting that the Canadian government has refused to allow public IT projects to use US-based hosting environments for fear of security breaches authorized via the Patriot Act. Nick added another example with the following:

Right before the manuscript of The Big Switch was shipped off to the printer ("manuscript" and "shipped off" are being used metaphorically here), I made one last edit, adding a paragraph about France's decision to ban government ministers from using Blackberrys since the messages sent by the popular devices are routinely stored on servers sitting in data centers in the US and the UK. "The risks of interception are real," a French intelligence official explained at the time.
I hadn't thought too much about the political consequences of the cloud since first reading Nick's book, but these stories triggered a vision that I just can't shake.

Let me explain. First, some setup...

One of the really cool visions that Bill Coleman used to talk about with respect to cloud computing was the concept of "follow the moon"; in other words, moving running applications globally over the course of an earth day to where processing power is cheapest--on the dark side of the planet. The idea was originally about operational costs in general, but these days Cassatt and others focus this vision around electricity costs.

The concept of "moving" servers around the world was greatly enhanced by the live motion technologies offered by all of the major virtualization infrastructure players (e.g. VMotion). With these technologies (as you all probably know by now), moving a server from one piece of hardware to another is as simple as clicking a button. Today, most of that convenience is limited to within a single network, but with upcoming SLAuto federation architectures and standards that inter-LAN motion will be greatly simplified over the coming years.

(It should be noted that "moving" software running on bare metal is possible, but it requires "rebooting" the server image on another physical box.)

The key piece of the puzzle is automation. Whether simple runbook-style automation (automating human-centric processes) or all-out SLAuto, automation allows for optimized decision making across hundreds, thousands or even tens of thousands of virtual machines. Today, most SLAuto is blissfully unaware of runtime cost factors, such as cost of electricity or cost of network bandwidth, but once the elementary SLAuto solutions are firmly established, this is naturally the next frontier to address.

But hold on...

As the articles I noted earlier suggest, early cloud computing users have discovered a hitch in the giddy-up: the borders and politics of the world DO matter when it comes to IT legislation.

If law will in fact have such an influence on cloud computing dynamics, it occurs to me that a new cost factor might outshine simple operations when it comes to choosing where to run systems; namely, legality itself. As businesses seek to optimize business processes to deliver the most competitive advantage at the lowest costs, it is quite likely that they will seek out ways to leverage legal loopholes around the world to get around barriers in any one country.

Now, this is just pie-in-the-sky thinking on my part, and there are 1000 holes here, but I think its worth going through the exercise of thinking this out. The problem is complicated, as there are different laws that apply to data and the processing being one on that data (as well as, in some jurisdictions, the record keeping about both the data and the processing). However, there are technical solutions available today for both data and processing that could allow a company to mix and match the geographies that give them the best legal leverage for the services they wish to offer:
  • Database Sharding/Replication

    Conceptually, the simplest way to keep from violating any one jurisdiction's data storage or privacy laws is to not put the data in the jurisdiction. This would be hard to do, if not for some really cool data base sharding frameworks being released to the community these days.

    Furthermore, replicate the data in multiple jurisdictions, but use the best-case instance of that data for processing happening in a given jurisdiction. In fact, by replicating a single data exchange into multiple jurisdictions at once, it becomes possible to move VMs from place to place without losing (read-only, at least) access to that data.

  • VMotion/LiveMotion

    From a processing perspective, once you solve legally accessing the data from each jurisdiction, you can now move your complete processing state from place to place as processing requires, without losing a beat. In fact, with networks getting as fast as they are, transfer times at the heart of the Internet may be almost as fast as on a LAN, and those times are usually measured in the low hundreds of milliseconds.

    So, run your registration process in the USA, your banking steps in Switzerland, and your gambling algorithms in the Bahamas. Or, market your child-focused alternative reality game in the US, but collect personal information exclusively on servers in Madagascar. It may still be technically illegal from a US perspective, but who do they prosecute?

Again, I know there are a million roadblocks here, but I also know both the corporate world and underworld have proven themselves determined and ingenious technologists when it comes to these kinds of problems.

As Leslie noted, our legislators must understand the economic impact of a law meant for a physical world on an online reality. As Nick noted, we seem to be treading into that mythical territory marked on maps with the words "Here Be Dragons", and the dragons are stirring.

Thursday, May 22, 2008

Cassatt Announces Active Response 5.1 with Demand Base Policies

Ken Oestreich blogged recently about the very cool, probably landmark release of Cassatt that just became available, Cassatt Active Response 5.1. He very eloquently runs down the biggest feature--demand based policies--so I won't repeat all of that here. What I thought I would do instead is relate my personal thoughts on monitoring based policies and how they are the key disruptive technology for data centers today.

To be sure, everyone is talking about server virtualization in the data center market today, and that's fine. It's core short-term benefit, physical system consolidation and increased utilization is key for cost-constrained IT departments, and features such as live motion and automatic backup are creating new opportunities that should be carefully considered. However, virtualization alone is limited in its applications, and does little to actually optimize a data center over time. (This is why VMWare is emphasizing management over just virtualizing servers these days.)

The technology that will make the long term difference is resource optimization: applying automation technologies to tuning how and when physical and virtual infrastructure is used to solve specific business needs. It is the automation software that will really change the "deploy and babysit" culture of most data centers and labs today. The new description will be more like "deploy and ignore".

To really optimize resource usage in real time, the automation software must use a combination of monitoring (aka "measure"), a policy engine or other logic system (aka "analyze") and interfaces to the control systems of the equipment and software it is managing (aka "respond"). It turns out that the "respond" part of the equation is actually pretty straight forward--lots of work, but straight forward. Just write "driver" like components that know how to talk to various data center equipment (e.g. Windows, DRAC, Cisco NX-OS, NetApp Data ONTAP, etc.), as well as handle error conditions by directly responding or forwarding the information to the policy engine.

The other two, however, require more immediate configuration by the end user. Measure and analyze, in fact, are where the entire set of Service Level Automation (SLAuto) parameters are defined and executed on. So, this is where the key user interface between the SLAuto system and end user has to happen.

What Cassatt has announced is a new user interface to define demand based policies as the end user sees fit. For example, what defines an idle server? Some systems use very little CPU while they wait for something to happen (at which point they get much busier), so simply measuring CPU isn't good enough in those cases. Ditto for memory in systems that are compute intensive but handle very little state.

What Cassatt did that is so brilliant (and so unique) is to allow the end user to leverage the full range of SNMP attributes for their OS, as well as JMX and even scripts running on the monitored system to create expressions that define an idle metric that is right for that system. For example, on a test system you may in fact say that a system is idle when the master test controller software indicates that no test is being run on that box. On another system, you may say its idle when no user accounts are currently active. Its up to you to define when to attempt to shut down a box, or reduce capacity for a scale-out application.

Even when such an "idle" system is identified, Cassatt gives you the ability to go further and write some "spot checks" to make sure they system is actually OK to shut down. For example, in the aforementioned test system, Cassatt may determine that its worth trying to power down a system, but a spot check could be run to determine if a given process is still running, or an administrator account is currently actively logged in to the box that would indicate to Cassatt that it should ignore that system for now.

I know of no one else that has this level of GUI configurable monitor/analyze/respond sophistication today. If anyone wants to challenge that, feel free. Now that I no longer work at Cassatt, I'd be happy to learn about (and write about) alternatives in the marketplace. Just remember that it has to be easy to configure and execute these policies, and scripting the policies themselves is not good enough.

It is clear from the rush to release resource optimization products for the cloud, such as RightScale, Scalr, and others, that this will be a key feature for distributed systems moving forward. In my opinion, Cassatt has launched itself into the lead spot for on premises enterprise utility computing. I can't wait to see who responds with the next great advancement.

Disclaimer: I am a Cassatt shareholder (or soon will be).

Tuesday, May 06, 2008

One advantage of utility computing infrastructure: Heisenberg's Uncertainty Principle applied to computing

I was casually browsing my Google Reader pages (which can also be followed on FriendFeed) when I came across a gem from Data Center Knowledge: apparently Peter Gabriel's web site servers were stolen from their hosting provider. All content and hardware gone, and fans left with nothing but an apology page.

Now, I'm a huge Gabriel fan, so this was interesting in part because I feel for the guy and hope nothing of great value was stored on those servers. However, my interest was peaked by the realization that this highlights one of the key values of decoupling software from hardware. To illustrate this advantage, I'd like to paraphrase Heisenberg's famous Uncertainty Principle:

In shared resource computing, you can locate the server, but you cannot firmly define what is running on the server (over time); conversely, you can define the software image, but it is difficult to firmly locate which server it is running on (over time).
Thus, if someone comes into a data center that is sharing server resources in a utility computing like model and steals a server, they will very likely get no data whatsoever. Conversely, if they want the data, they have to steal all of the storage associated with the server image, which in many environments is spread amongst several physical drives; is dependent on the network infrastructure in which it is running; and is useless without both a compatible server to execute it, and a compatible management system to deliver it to that server.

To me, this greatly enhances system security over dedicated server models. If Gabriel's stuff had been PXE booted on random servers around the hosting center, from distributed storage systems, he may have foiled his thief's plans. He certainly would have made it much more technically difficult for them.

The more I learn about decoupling software from hardware, whether through server virtualization or policy-based dynamic deployment, the more I think its a no-brainer for most computing applications. Plus, it makes SLAuto possible--which has its own benefits, of course.

Saturday, May 03, 2008

Thinking about SLAuto in a frenzied cloud

I've been quite silent for a week or two, mostly because of my responsibilities as a sales engineer; doing my part in closing key deals for my employer. I've spent this time sitting in meetings, installing and configuring software, and measuring power savings in large dev/test lab installations. (By large I mean hundreds approaching thousands of servers.) All in all, its been a successful couple of weeks, but its kept me from keeping too close an eye on the big news coming out of the cloud and utility computing markets.

However, as I thought about this more, I realized that I have drifted significantly from my core subject, Service Level Automation (or SLAuto), in the last six months or so--mostly due to the incredible burst of cloud computing innovation to be announced and/or delivered in that time frame. I still believe that there are two key components to an open cloud market that scales:

  • Portable platforms that allow customers to change vendors on a whim
  • Automation that takes action to acquire, release or replace services based on pre-determined service targets

The latter, simply said, is SLAuto.

Of course, what is happening is sort of the nascent birth of cloud computing technologies, where the DNA hasn't had a chance to recombine to build long term survivability into any given "species" yet. We all knew that AWS was doing cool things, but who knew that they would cross the chasm in terms of customer demand as completely as they did? Yet, there is no portability story for Amazon (at least not off of Amazon); and the market forming for SLAuto (see RightScale and others) is tightly tied to the Amazon platform.

The rest of the "big" announcements are worse: Microsoft has no concept of management in Live Mesh (other than synchronization) that I can see, and Google and Yahoo are both building platforms with developers in mind, where service levels are a business agreement, not a platform differentiator. I understand we are taking baby steps here, but I wonder how long it is before corporate IT realizes that they are both a) locked in (at least in an economic sense), and b) paying too much to operate software that doesn't even run in their data center.

Now, I say all of this, but truth be told, most corporate IT shops don't do SLAuto today. So, why should this change in the cloud? I hinted at it earlier: scale. Not scale of functional execution or data access, as we usually think of the term, but scale of market--the speed at which companies will need to respond to the ever evolving marketplace for cloud services and platforms. As self-professed "open" nature of Google and Yahoo's platforms become more of a reality, combined with true innovation in "industry" standard APIs (for capacity management, code platforms and feature integration), there is little doubt that pressure will be on the IT shop to optimize the cost of delivering business services to the rest of the company. Again, I argue that this cannot be done without SLAuto. Prove me wrong.

I am really concerned that SLAuto is still considered "bleeding edge" in most IT shops. Its not rocket science, and the future of IT cost management almost certainly has to be built around it. On the other hand, perhaps as some of these customers I worked with the last couple of weeks serve as references to the value of SLAuto--at least in terms of energy costs--more of them will understand its urgency.

Monday, April 07, 2008

Google announces ultimate cloud lock-in platform

I was about to write a long post about how all the big guys are starting with storage as a cloud service (based on the rumor that Google was going to announce BigTable as their first cloud service, and HP's new offering), when I took the time to watch Scoble's (unintentially) multi-part coverage [1] [2] [3] of the mysterious Google announcement (on Qik). And--just to screw with me--do they announce a data-only offering? Of course not, they announce Google App Engine.

Update: Here is a link to the official Google coverage of the announcement on YouTube.

What is Google App Engine? Well, detailed coverage is all over the web; see:

Mike Arrington (TechCrunch)
What this all means: Google App Engine is designed for developers who want to run their entire application stack, soup to nuts, on Google resources. Amazon, by contrast, offers more of an a la carte offering with which developers can pick and choose what resources they want to use.
Bob Warfield (SmoothSpan) [1] [2] [3]
However, the short-short version is it is a complete scalable and manageable runtime environment to build, test and run scalable web applications. (I don't say "highly scalable" for reasons that will be clear later.) This environment is made up of the following five core components (today):
  1. Scalable Serving Infrastructure - Basically the Google infrastructure, including everything but the Python code and web templates themselves

  2. Python Runtime - All of the infrastructure to deliver and execute your application in a distributed environment

  3. Software Development Kit - Allows you to code your application on your local system before deploying to Google.

  4. Web-based Admin Console - A web application including at least simplistic version management (including rollback), running system statistics and errors, access to the datastore (see below) and access to log files

  5. Datastore - BigTable storage (I don't know enough about BigTable yet to say more)
All of this delivered in a free (as of the beta) limited-scale package:

500MB storage
200 Megacycles CPU
10GB Bandwidth In/Out

Should be around 5 million page views a month for the average web application. This is a reasonable scale, but would not qualify as "highly scalable" in most large web properties' books.

What does this add up to, in my opinion? The ultimate cloud lock-in story. (As background, watch Scoble's first video from about 3:17-5:25.) Not a single thing in your web application will not be dependent on Google if you use this technology--not even your Python code. (For proof, check out the "includes" in the coding demo--at around 8:44 of the first video.) Everything you do will depend on a piece of Google intellectual property. You datastore is BigTable, your operations environment is Web Operations Center, etc., etc., etc.

This isn't cloud computing, its just a cool web app hosting tool. OK, I exaggerate. It is cloud, but its exactly the kind of cloud most enterprises should avoid. If you are building a web business, and this tickles your fancy, go for it. You can't beat the price, and you've got to love the feature set. If you are a Fortune 500 looking for where to launch your next CRM interface, forget it. There are safer ships to sail than this--e.g. Amazon EC2 (et. al.), Mosso, etc.; better yet, convert what you have.

If it sounds like I am being reactionary to this announcement, I suppose I am in a way. Unfortunately, I have spent a lot of time thinking about how today's high-scale business systems will move to the cloud, and I think the market needs more maturity before this can be done safely. You need flexibility of the type and architecture of your application, and which components you choose to leverage. There is no such choice with Google.

The best part of Scoble's coverage was when he talked to two developers at the end (~18:15). One (Michael Malone) notes the biggest problem is "lock-in". The woman standing next to him (Mia Culver) calls it a "proprietary platform".

I love it. There is no fooling this savvy, open source focused market. If you want to win hearts and minds, be open. When the hell are we going to get that application portability standard we've been demanding, eh?

(On a side note, the required demo for cloud application development is now to build a web app from scratch and deploy it so the audience can access it from their laptops in 5-8 minutes. Google did it tonight, and Heroku did it at the Cloud Demo Night earlier this month.)

Some more of my notes from the announcement:
Can't do:
  1. No write to file system. (Reads OK, so you can use props files, etc.)

  2. No direct web calls (instead utilizes "URL fetch" API)

  3. No threads (single thread only, but distributed across multiple systems)

  4. Python only first language, looking for input on next language to attack (must have runtime that can be "hardened")
Administration Console gives the ability to see and manipulate running app code (by version) and data

Is the identity environment for all hosted apps Google login? Is everyone comfortable with this?

The initial 10,000 beta accounts may already be gone.

Quota based, no ability to grow past above for now.

Also, no "offline processing" today, but looking into it for future. (Sounds like batch stuff, etc.)
I have an interesting experiment I wish I could get to. I want to marry Scalr, the open source Amazon EC2 automation environment with a policy-based SLAuto environment to get the ultimate in flexible, open and coding agnostic autonomic operations, both in the cloud and "at home". Anyone want to beat me to it? (Come to think of it, why is Google still hosting Scalr now that App Engine is live? Hmmmm....)

Friday, April 04, 2008

John Willis Honors Me with Inaugural Cloud Cafe Podcast

I am the inaugural guest in John Willis's Cloud Cafe podcast series. I couldn't be more honored.

Those of you who have been following this whole "what is Cloud Computing" debate may have had the opportunity to see the conversations between several bloggers regarding how to define cloud computing and related technologies. John Willis, of the John Willis ESM Blog, is making a key contribution by taking on the challenge of classifying vendors in this space. As I had some issues with his classification of Cassatt, he thought the best way to resolve that was to invite me to launch his new series.

Two things were resolved in this podcast.

First, I learned first hand what a classy guy John is. He handled the interview very well, let me talk my butt off (a talent I got from my minister mother, I think) and had several observations over the course of the conversation that showed his tremendous experience in the enterprise systems management space. I feel quite sheepish that I ever hinted that he wasn't being forthright with his audience. Lesson gratefully learned; apology gladly offered.

Second, John and I were always much closer in our visions of cloud computing, utility computing and enterprise systems than it might have appeared at first. Our conversation raged from the aforementioned "what is cloud computing" question, to topics such as:

  • the relationship between cloud and utility computing,
  • the cultural challenge facing enterprises seeking the economic returns of these technologies,
  • how cloud and utility computing revolutionize performance and capacity planning, and
  • where Hadoop and CloudDB fit into all of this.
In the end, I think John and I agreed that cloud computing is more than just virtualization on the Internet. I very much enjoyed the conversation, and I hope you will take the time to listen to this podcast.

Got questions or comments? Post them here or on John's blog; I will check both.

Finally, I will be working to get Cassatt's entry in John's classifications updated as a result of the discussion.

Friday, March 28, 2008

MapReduce reaches adolescence

I have to admit I find myself growing more impressed with the MapReduce (and related algorithm) community every day. I spend the better part of an hour watching Stu Hood of Rackspace/Mailtrust discussing MapReduce, Mailtrust's use of it for daily log processing, and comparing it to SQL. I'm a MapReduce newbie, so I was happy to find Stu's overview clear, careful and at a level I could grasp.

His overview of Hadoop (an open source implementation of a MapReduce framework) was equally enlightening, and I learned that Hadoop is more than the framework, but it includes a distributed file system as well. This is where I think SLAuto starts to become important, as it will be critical not only to monitor which systems in a Hadoop cluster are alive at any time (thus providing access to their storage), but also to correct failures by remounting disks on additional nodes, provisioning new nodes to meet increased data loads, etc. Granted, I know just enough to be dangerous here, but I would bet that I could sell the value of SLAuto in a MapReduce environment.

Another interesting overview of the MapReduce space comes from Greg Linden. (Damn, now I've mentioned Greg twice in a row...my groupie tendencies are really showing these days! -) Greg points us to notes taken at the Hadoop Summit by James Hamilton, an architect on the Windows Live Platform Services team. I haven't read through them all yet, but I like the breakdown of many of the big projects getting a lot of coverage among techies these days: Yahoo's PIG and HBase, as well as Microsoft's DRYAD. Missing is CouchDB, but I plan to watch Jan Lehnardt's talks [1][2] on that as soon as I get a moment.

Again, the reason MapReduce is being covered in a blog about Service Level Automation and utility computing is that as soon as I see "tens of thousands of nodes", I also see "no way human beings can meet the SLAs without automation". At least not without significant costs compared to automating. System provisioning, monitoring, autonomic scaling and fail-resistance are not built in to Hadoop, they are simply easy to support. Something else is needed to provide SLAuto support at the infrastructure layers.

Wednesday, February 27, 2008

Enterprise Architecture, Business Continuity and Integrating the Cloud

(Update: Throughout the original version of this post, I had misspelled Mr. Vambenepe's name. This is now corrected.)

William Vambenepe, a product architect at Oracle focusing on enterprise management of applications and middleware, pointed me to a blog by David Linthicum on IntelligentEnterprise that makes the case for why enterprise architects must plan for SaaS. In a very high level, but well reasoned post, Linthicum highlights why SaaS systems should be considered a part of enterprise architectures, not tangential to them.

As Vambenepe points out, perhaps the most interesting observation from Linthicum is the following:

Third, get in the mindset of SaaS-delivered systems being enterprise applications, knowing they have to be managed as such. In many instances, enterprise architects are in a state of denial when it comes to SaaS, despite the fact that these SaaS-delivered systems are becoming mission-critical. If you don't believe that, just see what happens if Salesforce.com has an outage.

I don't want to simply repeat Vambenepe's excellent analysis, and I absolutely agree with him. So let me just add something about SLAuto.

Take a look at Vambenepe's immediate response:
I very much agree with this view and the resulting requirements for us vendors of IT management tools.
Now add the comments from Microsoft's Gabriel Morgan that I discussed a couple of weeks ago.
Take for example Microsoft Word. Product Features such as Import/Export, Mail Merge, Rich Editing, HTML support, Charts and Graphs and Templates are the types of features that Customer 1.0 values most in a product. SaaS Products are much different because Customer 2.0 demands it. Not only must a product include traditional product features, it must also include operational features such as Configure Service, Manage Service SLA, Manage Add-On Features, Monitor Service Usage Statistics, Self-Service Incident Resolution as well.
Gabriel's point boiled down to the following equation:
Service Offering = (Product Features) + (Operational Features)
which I find to be entirely in agreement with Linthicum and Vambenepe.

As I am wont to do, let me push "Operational Features" as far as I think they can go.

In the end, what customers want from any service--software, infrastructure or otherwise--is control over the balance of quality, cost and time-to-market. Quality is measured through specific metrics, typically called service level metrics. Service level agreements (SLAs) are commitments to maintain service level metrics within commonly agreed boundaries and rules. In the end, all of these "operational features" are about allowing the end user to either

  1. define the service level metrics and/or their boundaries (e.g. define the SLA), or
  2. define how the system should respond if a metric fails to meet the SLA.

Item "2" is SLAuto.

I would argue that what you don't want is a closed loop SLAuto offering from any of your vendors. In fact, I propose right here, right now, that a standard (and, I am sure Simon Wardley would argue, open source) protocol or set of protocols for the following:
  1. Defining service level metrics (probably already exists?)
  2. Defining SLA bounds and rules (may also exist?)
  3. Defining alerts or complex events that indicate that an SLA was violated

Vendors could then use these protocols to build Operational Features that support a distributed SLAuto fabric, where the ultimate control over what to do in severe SLA violations can be controlled and managed outside of any individual provider's infrastructure, preferably at a site of the customer's choosing. This "customer advocate" SLAuto system would then coordinate with all of the customer's other business systems' individual SLAuto to become the automated enforcer of business continuity. In the end, that is the most fundamental role of IT, whether it is distributed or centralized, in any modern, information driven business.

"Nice, James," you say. "Very pretty 'pie-in-the-sky' stuff, but none of it exists today. So what are we supposed to do now?"

Implement SLAuto internally in your own data centers with your existing systems, that's what. Integrate SLAuto for SaaS as you understand the Operational Feature APIs from your vendors, and those vendors, your SLAuto vendor and/or your systems talent can develop interfaces into your own SLAuto infrastructure.

Evolve towards nirvana, don't try to reach it by taking tabs of vendor acid.

If you want more advice on how to do all of this, drop me a line (james dot urquhart at cassatt dot com) or comment below.

Wednesday, February 20, 2008

Data Goes SLAuto at Oracle

Thanks to Steve Jones, check out this presentation from David Chappell, Oracle VP and CTO of SOA, titled "Next-Generation Grid Enabled SOA". (A shorter written article can be found in at SOA Magazine's site.) Chappell outlines the work that Oracle is doing at turning the traditional model of application scalability on its head; instead of a fixed amount of database resources and scaling the applications/services horizontally, scale the database (using a cool complex adaptive systems approach) and alleviate much of the need to scale apps and services (except for CPU bound services). For someone like me, that's mind blowing.

Add to that the fact that the data management functions are relatively homogenous (though the infrastructure may not be), and aware of its resource utilization, and you can see why they are claiming a certain amount of hardware-metric based SLAuto.

(Hardware metric based SLAuto is based in measurements of hardware components, such as CPU utilization, memory utilization and so on. Software-based SLAuto usually uses business metrics such as transaction rates, active accounts, etc. to make scaling decisions.)

The catch? Well, everything must be written to use the "Data Grid" if its to take advantage of these capabilities. Legacy applications need not apply. (Could be the deal killer for David's "Not your MOM's Bus" concept.)

It seems to me that if Oracle wants this approach to catch on, it should open source a reference implementation as soon as possible. I'm not an expert at the most recent data processing approaches, but it would seem to me that Map-Reduce approaches would be complimentary to the Data Grid. However, Hadoop implementations would generally only be integrated with a data grid if there was an open source alternative. Otherwise, MySQL will continue to be the first choice. Open Source would also speed up integration between the data grid and infrastructure automation such as Cassatt and its competitors.

Dave hints at a URL for more info on the Oracle site, but I can't find it. If anyone tracks it down, I would appreciate any help I can get.

Thursday, February 07, 2008

The importance of operations to online services customers

I hadn't caught up on Gabriel Morgan's blog in a while, so I'm a week or so late in seeing his interesting post on the importance of operations features in a SaaS product offering. Gabriel works at Microsoft on the team that is looking at the Software plus Services offerings introduced by Ray Ozzie a few months ago. According to Gabriel, being a software product company, Microsoft has occasionally been slow to learn a key lesson in the online services game:

In the traditional packaged software business, product features define what a product is but Customer 2.0 expects to have direct access to operational features within the Service Offering itself.

Take for example Microsoft Word. Product Features such as Import/Export, Mail Merge, Rich Editing, HTML support, Charts and Graphs and Templates are the types of features that Customer 1.0 values most in a product. SaaS Products are much different because Customer 2.0 demands it. Not only must a product include traditional product features, it must also include operational features such as Configure Service, Manage Service SLA, Manage Add-On Features, Monitor Service Usage Statistics, Self-Service Incident Resolution as well. In traditional packaged software products, these features were either supported manually, didn't exist or were change requests to a supporting IT department.

In other words "Service Offering = (Product Features) + (Operational Features)".

Wow. What a simple way to state something I've been concerned about for some time now: as you move your enterprise into the cloud, will your service providers (be it SaaS, HaaS, PaaS or others) provide you with the tools and data you need to successfully operate your business? How will you be able to interact with both the service provider's software and personel to make sure those operations run a) according to your wishes, and b) with no negative impact on your business?

Gabriel goes on:
Guess who builds and supports these Operational Features? Your friendly neighborhood IT department in conjunction with the Operations and Service Offering product group. This raises the quality bar for your traditional IT shop.
Heck, yeah. And guess what? Should a business do something crazy--oh, say, select SaaS products from more than one vendor to integrate into their varied business processes--they will need not only to build solid operational ties with each vendor, but integrate those operational features across vendors. Think about that.

How best to do that? You shouldn't be surprised when I tell you that a key element of the solution is SLAuto under the control of the business. Managing SaaS systems to business-defined service levels will be a critical role of IT in the cloud-scape of tomorrow.

Wednesday, February 06, 2008

Cloud computing heats up

Today's reading has been especially interesting, as it has become clear that a) "cloud computing" is a concept that more and more IT people are beginning to understand and dissect, and b) there is the corresponding denial that comes with any disruptive change. Let me walk you through my reading to demonstrate.

I always start with Nick Carr, and today he did not disappoint. It seems that IBM has posited that a single (distributed) computer could be built that could run the entire Internet, and expand as needed to meet demand. Of course, this would require the use of Blue Gene, an IBM technology, but man does it feed right into Nick's vision of the World Wide Computer. To Nick's credit, he seems skeptical--I know I am. However, it is a worthy thought experiment to think how one would design distributed computing to be more efficient if one had control over the entire architecture from chip to system software. (Er, come to think of it, I could imagine Apple designing a compute cloud...)

I then came across an interesting breakdown of cloud computing by John M Willis, who appears to contribute to redmonk. He breaks down the cloud according to "capacity-on-demand" options, and is one of the few to include a "turn your own capacity into a utility" component. Unfortunately, he needs a little education of these particular options, but I did my best to set him straight. (I appreciate his kind response to my comment.) If you are trying to understand how to break down the "capacity-on-demand" market, this post (along with the comments) is an excellent starting place.

Next on the list was a GigaOm post by Nitin Borwankar stating his concept of "Data Property Rights" and expressing some skepticism about the "data portability" movement. At first I was concerned that he was going to make an argument reinforced certain cloud lock-in principles, but he actually makes a lot of sense. I still want to see Data Portability as an element of his basic rights list, but he is correct when he says if the other elements are handled correctly, data portability will be a largely moot issue (though I would argue it remains a "last resort" property right).

Dana Blankenhorn at ZDNet/open-source covers a concept being put forth by Etelos, a company I find difficult to describe, but that seems to be an "application-on-demand" company (interesting concept). "Opportunity computing", as described by Etelos CEO Danny Kolke describes the complete set of software and infrastructure required to meet a market opportunity on a moments notice. “Opportunity computing is really a superset of utility computing,” Kolke notes. Blankenhorn adds,


"It’s when you look at the tools Kolke is talking about that you begin to get the picture. He’s combining advertising, applications, the cash register, and all the relationships which go into those elements in his model. "

In other words, it seems like prebuilt ecommerce, CRM and other applications that can quickly be customized and deployed as needed, to the hosting solution of your choice. My experience with this kind of thing is that it is impossible to satisfy all of the people, all of the time, but I'm fascinated by the concept. Sort of Platform as a Service with a twist.

Finally, the denial. The blog "pupwhines" remains true to its name as its author whimpers about how Nick "has figured out that companies can write their own code and then run it in an outsourced data center." Those of you that have been following utility/cloud computing know that this misses the point entirely. Its not outsourcing capacity that is new, but its the way it is outsourced--no contracts for labor, no work-order charges for capacity changes, etc. In other words, just pay for the compute time.

With SLAuto, it gets even more interesting as you would just tell the cloud "run this software at these service levels", and the who, what, where and how would be completely hidden from you. To equate that with the old IBM/Accenture/{Insert Indian company here} mode of outsourcing is like comparing your electric utility to renting generators from your neighbors. (OK, not a great analogy, but you get the picture.)

Another interesting data point for measuring the booming interest in utility and cloud computing is the fact that my Google Alerts emails for both terms have grown from one or two links a day, to five or more links each and every day. People are talking about this stuff because the economics are so compelling its impossible not to. Just remember to think before you jump on in.

Tuesday, January 29, 2008

One Step To Prepare For Cloud Computing

Some of you may be wondering why I am making such a big stink about software architecture on a blog about service level automation (SLAuto). Well, as Todd Biske points out, "the relationships (and potentially collisions) between the worlds of enterprise system management, business process management, web service management, business activity monitoring, and business intelligence" are easier to resolve if the appropriate access to metrics is provided for a software service. For SLAuto, this means the more feedback you can provide from the service, process, data and infrastructure levels of your software architecture, the easier it is to automate service level compliance.

Let's look at a few examples for each level:

  • Service/Application: From the end user's perspective, this is what service levels are all about. Key metrics such as transaction rates (how many orders/hour, etc.), response times, error rates, and availability are what the end users of a service (e.g. consumers, business stakeholders, etc.) really care about.
  • Business Process: Business process metrics can warn the SLAuto environment about cross-service issues, business rule violations or other extraordinary conditions in the process cycle that would warrant capacity changes at the BPM or service levels.
  • Data Storage/Management: Primarily, this layer can inform the SLAuto system about storage needs and storage provisioning, which in turn is critical to automated deployment of applications into a dynamic environment.
  • Infrastructure: This is the most common form of metric used to make SLAuto decisions today. Such metrics as CPU utilization, memory utilization and I/O rates are commonly used in both virtualized and non-virtualized automated environments.

As noted, digital measurement of these data points can feed an SLAuto policy engine to trigger capacity adjustment, failure recover or other applicable actions as necessary to remain within defined service thresholds. While most of the technology required to support SLAuto is available, the truth is that the monitoring/metrics side of things is the most uncharted territory. As an action item, I ask all of you to take Todd's words of wisdom into account, and design not only for functionality, but also manageability. This will aid you greatly in the quest to build fluid systems that can best take advantage of utility infrastructure today.

Sunday, January 20, 2008

Evidence of pending doom and imminent salvation...

Two news articles that occurred as soon as I went offline for the birth of my daughter provide increasing evidence of the importance of service level automation and image portability between vendors:

  • Joyent, one of the most ambitious new "capacity on demand" managed hosting services, has experienced a multi-day outage that has affected two of their prime storage services. No failover path was available to users of the services, and there is no mention of functionality or services to assist customers with moving--temporarily or permanently--to another vendor's service. Odds are high that some of these customers have lost access to key data, or are flying without substantial backups to key systems. Any decision to move to a different servers (like Twitter will according to the post) is on the customer's own dime.

    A prime example of the dangers of vendor lock-in that Simon and I have been warning you about...


  • Oracle has announced its intention to build and sell "Grid 2.0" technology that will target--yes, you heard right--service level automation. Welcome to the SLAuto game boys. I hope you're ready to talk standards for image and policy portability; as well as policy platform interoperability. Otherwise, you're just creating a new DB grid "silo", and not helping anyone in the long run. Please, feel free to educate me if you think otherwise...

These events show the caution that users of cloud services must employ. Be ready to take on increased integration responsibilities as you deploy more and more elements of your datacenter to the cloud, automate more of the management of those elements, and find the product landscape one in which there (still) is no silver bullet. You may not be writing apps, but you sure as heck will be writing the orchestration that will tie the apps you employ into a cohesive business process ecosystem. You may also find yourself writing backup integration again, just in case you experience "Joyent 2.0"...

Wednesday, December 12, 2007

Software fluidity and system security

I came across this fascinating conversation tonight between Rich Miller (whom I've exchanged blogs with before) and Greg Ness regarding the intense relationship between network integrity, system security and VM-based software fluidity. What caught my attention about this conversation is the depth at which Rich, Greg and others have thought about this branch of system security--what Greg refers to as Virtsec. I learned a lot by reading both authors' posts.

I know nothing about network security or virtualization security, frankly, but I know a little about network virtualization and the issues that users have in replicating secure computing architectures across pooled computing resources. Rich makes a comment in the discussion that I want to comment on from a purely philosophical point of view:

Consider this: It's not only network security, but also network integrity that must be maintained when supporting the group migration of VMs. If one wants to move an N-tier application using VMware's VMotion, one wants a management framework that permits movement only when the requirements of the VM "flock" making up the application are met by the network that underpins the secondary (destination) location. By that, I mean:

  • First, the assemblage of VMs need to arrive intact.

    If, because of a change in the underpinning network, a migration "flight plan" no longer results in a successful move by all the piece parts, that's trouble. If disaster strikes, you don't want to find that out when invoking the data center's business continuity procedure. All the VMs that take off from the primary location, need to land at the secondary.


  • Second, the assemblage's internal connections as well as connections external to the "flock" must continue to be as resilient in their new location as they were in their original home.

    If the use of VMotion for an N-tier application results in the a new instance of the application that ostensibly runs as intended, but is susceptible to an undetected, single point of network failure in its new environment, someone in the IT group's network management team will be looking for a new job.
Here is exactly where I believe application architectures are suddenly critical to the problem of software fluidity. In a well contained multi-tier application (a very turn-of-the-millennium concept) it is valid to consider the migration of the "flock" as a network integrity problem. However, when it comes to the modern world of SOA, BPM and application virtualization, suddenly application integrity becomes a dynamic discovery issue which is only partly dependent on network access.

In other words, I believe most modern enterprise software systems can't rely on the "infrastructure" to keep their components running when they are moved around the cloud. Its not good enough to say "gee, if I get these running on one set of VMs, I shouldn't have to worry about what happens if those VMs get moved". Rich hints strongly at understanding this, so I don't mean to accuse him of "not getting it". However, I wonder what Replicate Technologies is prepared to tell their clients about how they need to review their application architectures to work in such a highly dynamic environment. I'd love to hear it from Rich.

Also, from Greg, I'd be interested in knowing if he's thought beyond the effects on network security of virtsec to the effects on application security. At the very least, I think an increasing dependency on dynamic discovery of required resources (e.g. services, data sources, EAI etc.) means an increased need for virtsec to be application aware as well as network aware. I apologize if I'm missing a virtsec 101 concept here, as I haven't yet read all that Greg has written about the subject, but I'm disturbed that the little I've read so far seems to assume that VMs can be managed purely as servers, a common mistake when considering end-to-end Service Level Automation (SLAuto) needs.

I intend to keep one eye on this discussion, as virtsec is clearly a key element of software fluidity in a SOA/BPM/VM world. It is even more critical in a true utility/cloud computing world, where your software would ideally move undetected between capacity offered by entirely disparate vendors. Its no good chasing the cheapest capacity in the world if it costs you your reputation as a security conscious IT organization.

By the way, Rich, I dropped the virtual football after your last post in our earlier conversation...I just found a blog entry I wrote about 6 months ago during that early discussion about VM portability standards that I never posted. Aaaargh... My apologies, because I liked what you were saying, though I was concerned about non-VM portability and application awareness at the time as well. I continue to follow your work.

Wednesday, December 05, 2007

How fluid is your software?

I come from a software development background, and I can never quite get the itch to build the perfect architecture out of my system. That's partly why it is so hard for me to blog about power, even though it is an absolutely legitimate topic, and a problem that needs to be attacked from many fronts. However, power is not a software issue, it is a hardware and facilities issue, and my heart just isn't there when it comes to pontificating.

All is not lost, however, as Cassatt still plays in the utility computing infrastructure world, and
I get plenty of exposure to dynamic provisioning, service level automation (SLAuto) and the future of capacity on demand. And to that end, I've been giving a lot of thought to the question of what, if any, software architecture decisions should be made with utility computing in mind.

While a good infrastructure platform won't require wholesale changes to your software architecture (and none at all, if you are willing to live with consequences that will become obvious later in this discussion), the very concept of making software mobile--in the changing capacity sense, not the wireless device sense--must lead all software engineers and architects to contemplate what happens when their applications are moved from one server to another, or even one capacity provider to another. There are a myriad of issues to be considered, and I aim to cover just a few of them here.

The term I want to introduce to describe the ability of application components to migrate easily is "fluidity". The definition of the term fluidity includes "the ability of a substance to flow", and I don't think its much of a stretch to apply the term to software deployments. We talk about static and dynamic deployments today, and a fluid software system is simply one that can be moved easily without breaking the functionality of the system.

An ideally fluid system, in my opinion, would be one that could be moved in its entirety or in pieces from one provider to another without interruption. As far as I know, nobody does this. (3TERA claims they can move a data center, but as I understand it you must stop the applications to execute the move.) However, for purely academic reasons, let's analyze what it would take to do this:


  1. Software must be decoupled from physical hardware. There are currently two ways to do this that I know of:
    • Run the application on a virtual server platform
    • Boot the file system dynamically (via PXE or similar) on the appropriate capacity
  2. Software must loosely coupled from "external" dependencies. This means all software must be deployable without hard coded reference to "external" systems on which it is dependent. External systems could be other software processes on the same box, but the most critical elements to manage here are software processes running on other servers, such as services, data conduits, BPMs, etc.
  3. Software must always be able to find "external" dependencies. Loose coupling, as most of you know, is sometimes easier said than done, especially in a networked environment. Critical here is that the software can locate, access and negotiate communication with the external dependencies. Service registries, DNS and CMDB systems are all tools that can be used to help systems maintain or reestablish contact with "external" dependencies.
  4. System management and monitoring must "travel" with the software. Its not appropriate for a fluid environment to become a Schrodinger's box, where the state of the system becomes unknown until you can reestablish measurement of its function. I think this may be one of the hardest requirements to meet, but at first blush I see two approaches:
    • Keep management systems aware of how to locate and monitor systems as they move from one system to another.
    • Allow monitoring systems to dynamically "rediscover" systems when they move, if necessary. (Systems maintaining the same IP address, for instance, may not need to be rediscovered.)

This is just a really rough first cut of this stuff, but I wanted to put this out there partly to keep writing, and partly to get feedback from those of you with insights (or "incites") into the concept of software fluidity.

In future posts I'll try to cover what products, services and (perhaps most importantly) standards are important to software fluidity today. I also want to explore whether "standard" SOA and BPM architectures actually allow for fluidity. I suspect they generally do, but I would not be suprised to find some interesting implications when moving from static SOA to fluid SOA, for instance.

Respond. Let me know what you think.

Tuesday, November 06, 2007