The Wisdom of Clouds: September 2007

Friday, September 28, 2007

The IT Power Divide

The electric grid and the computing grid (RoughType: Nicholas Carr): Nicholas describes the incredible disconnect between IT's perception of power as an issue

...[O]nly 12% of respondents believe that the energy efficiency of IT equipment is a critical purchasing criterion.

and the actual scale of the issue in reality

...[A] journeyman researcher named David Sarokin has taken a crack at estimating the overall amount of energy required to power the country's computing grid...[which] amounts to about 350 billion kWh a year, representing a whopping 9.4% of total US electricity consumption.

Amen, brother. In fact, the reason you haven't heard from me as often in the last two to three weeks is that I have been steadfastly attending a variety of conferences and customer prospect meetings discussing Active Power Management and SLAuto. What I've learned is that there are deep divides between the IT and facility views of electrical efficiency:

IT doesn't see the electric bill, so they think power is mostly an upfront cost issue (building a data center with enough power to handle eventual needs) and an ongoing capacity issue (figuring out how to divide power capacity among competing needs). However, their bottom line remains meeting the service needs of the business.
Facilities doesn't see the constantly changing need for information technology of the business, and sees electricity mostly as a upfront capacity issue (determining how much power to deliver to the data center based on square footage and proposed Kw/sq ft) and an ongoing cost issue (managing the monthly electric bill). The bottom line in this case is value, not business revenue.

Thus, IT believes that once they get a 1 Mw data center, they should figure out how to efficiently use that 1 Mw--not how to squeeze efficiencies out of the equipment to run at some number measurably below 1 Mw. Meanwhile, facilities gets excited about any technology that reduces overall power consumption and maintains excess power capacity, but lacks the insight into what approaches can be taken that will not impact the business's bottom line.

With an SLAuto approach to managing power for data centers, both organizations can be satisfied--if they would only take the time to listen to each other's needs. IT can get a technical approach that minimizes (or has zero effect) on system productivity, while facilities sees a more "optimal" power bill every month. Furthermore, facilities can finally integrate IT into the demand curtailment programs offered by their local power utilities, which can generate significant additional rebates for the company.

Let me know what you think here. Am I off base? Do you speak regularly with your facilities/IT counter part, and actively search for ways to reduce the cost of electricity while meeting service demand?

Monday, September 24, 2007

Service-Oriented Everything...

Agility Principle: Service-Oriented Network Architecture (eBiz: Mark Milinkovich, Director, Service-Oriented Network Architecture, Cisco Systems): Cisco is touting the network as the center of the universe again, but this article is pretty close to the truth about software and infrastructure architectures we are moving to. Most importantly, Mark points out that there is a three layer stack that actually binds applications to infrastructure:

Applications layer - includes all software used for business purposes (e.g., enterprise resource planning) or collaboration (e.g., conferencing). As Web-based applications rely on the Extensible Markup Language (XML) schema and become tightly interwoven with routed messages, they become capable of supporting greater collaboration and more effective communications across an integrated networked environment.

Integrated network services layer - optimizes communications between applications and services by taking advantage of distributed network functions such as continuous data protection, multiprotocol message routing, embedded QoS, I/O virtualization, server load balancing, SSL VPN, identity, location and IPv6-based services. Consider how security can be enhanced with the interactive services layer. These intelligence network-centric services can be used by the application layer through either transparent or exposed interfaces presented by the network.

Network systems layer - supports a wide range of places in the network such as branch, campus and data center with a broad suite of collaborative connectivity functions, including peer-to-peer, client-to-server and storage-to-storage connectivity. Building on this resilient and secure platform provides an enterprise with the infrastructure on which services and applications can reliably and predictably ride.

Of course, he's missing a key layer:

Physical infrastructure layer - represents the body of physical(and possibly virtual) infrastructure components that support the applications,network services and network systems, not to mention the storage environment,management environment and, yes, Service Level Automation (SLAuto) environment.

It is important to note that, while the network may becoming a computer in its own right, it still requires physical infrastructure to run. And all of these various application, integrated network, and network systems services that Mark mentions not only depend on this infrastructure, but can actually be loosely coupled to the physical layer in a way that augments the agility of all four layers.

For example, imagine a world where your software provisioning is completely decoupled from your hardware provisioning. In other words, adding an application to your production data center doesn't require you to predict exactly what load the application is going to add to the network, server or storage capacity. Rather, you simply load the application into the SLAuto engine, let traffic start to arrive, measure the stress on existing capacity, and order additional hardware as required. Or, better yet, order hardware at the end of a quarter based on trend analysis from the previous quarter. No need for the software teams and the hardware teams to even talk to each other.

I will admit that it is unlikely that many IT departments will ever get to that "pie-in-the-sky" scenario--for some the risk of not guessing high enough on capacity overwhelms the cost of predicting short to medium term load. However, SLAuto allows you to get past the problems of siloed systems, such as "hitting the ceiling" in allocated capacity. Even if the SLAuto environment runs out of excess physical capacity, it can borrow the capacity it needs for high priority systems from lower priority applications.

The best part is that, since the SLAuto environment tracks every action it takes, there are easy ways to get reports showing everything from capacity utilization trend analysis to cost of infrastructure for a given application.

Back to Mark's article, though. It is good to see some consensus in the industry on where we are moving, even if each vendor is trying to spin it as if they are the heart of the new platform. In the end though, if the network is indeed the computer, the network and the data center will need operating systems. Mark has entire sections dedicated to designing for application awareness (this is where most data center automation technologies fall woefully short), and designing for virtualization (including all aspects of infrastructure virtualization). He is right on the money here, but there needs to be something that coordinates the utilization of all of these virtualized resources. This is where SLAuto comes in.

Most importantly, don't forget to integrate SLAuto into all four layers. Make sure that each "high" layer talks to the layers below it in a way that decouples the higher layer from the lower layer. Make sure that each lower layer uses that information to determine what adjustments it needs to make (including, possibly, to send the information to an even lower layer). And make sure your physical infrastructure layer is supported by an automation environment that can adjust capacity usage quickly and painlessly as applications, services and networks demand.

As you prepare your service oriented architecture of the future, don't forget the operations aspects. We are on the brink of an automated computing world that will change the cost of IT forever. However, it will only work for you if you take all of the components involved in meeting service levels/operation levels into account.

Monday, September 10, 2007

Links - 09/10/2007

Brave New World (Isabel Wang): I can't begin to express how sorry I am to see Isabel Wang leave the discussion, as her voice has been one of the clearest expressions of the challenges before the MSP community. However, I understand her need to go where her heart takes her, and I wish her the best of luck in all of her endeavors.

(Let me also offer my condolences to Isabel and the entire 3TERA community for the loss of their leader and visionary, Vlad Miloushev. His understanding of the utility computing opportunity for MSPs will also be missed.)

MTBF: Fear and Loathing in the Datacenter (Aloof Architecture: Aloof Schipperke): Aloof discusses his mixed feelings about my earlier post on changing the mindset around power cycling servers. I understand his fears, and hear his concerns; MTBF (or more to the point, MTTF) isn't a great indicator of actual service experience. However, even by conservative standards, the quality and reliability of server components has improved vastly in the last decade. Does that mean perfection? Nope. But as Aloof notes, our bad experiences get ingrained in the culture, so we overcompensate.

CIOs Uncensored: Whither The Role Of The CIO? (InformationWeek: John Sloat): Nice generality, Bob! Seriously, does he really expect that *every* IT organization will shed its data centers for service providers? What about defense? Banking? Financial markets? While I believe that most IT shops are going to go to a general contractor/architect role, I think there is still a big enough market for enterprise data centers that markets to support them will go on for years to come.

That being said, most of you out there should look at your own future with a service-oriented computing (SOC?) world in mind.

Thursday, September 06, 2007

Fear and the Right Thing

An interesting thing about diving into the Active Power Management game is the incredible amount of FUD surrounding the simple act of turning a computer off. Considering the fact that:

server components manufactured in the last several years have intense MTBF values (measured in hundreds of thousands or even millions of hours),
the servers you will buy starting soon will all turn off pieces of themselves automagically, and
no one thinks twice about turning off laptop or desktop computers or their components,

the myth lives on that turning servers off and on is bad.

I understand where this concern comes from. Older disk drives were notoriously susceptible to problems with spin-up and spin-down. Don't get me started on power supplies in the late eighties and early nineties. My first job was as a sys admin for a small college where I regularly bought commodity 386 chassis power supplies and Seagate ST-220 disk drives. Even in the mid-nineties, older servers would all too frequently mysteriously die while we were restarting the system for an OS upgrade or after a system move.

Add to this the fact that enterprise computing went through its (first?) mainframe stage, where powering things off was contrary to the goal of using it as much as possible, you get a cultural mentality in IT that up time is king, even if system resources will be idle for great periods of time.

These days, though, the story has greatly changed. As Vinay documented, Cassatt starts and stops all of its QA and build servers every day. In over 18,826 power cycles, not a single system failed. In my interactions at customer sites, there have been zero failures in thousands of power cycles. Granted, that's not a scientific study, but it goes to the point that unexpected component failure is not a common occurrence during power cycles any more.

Of course, I'm looking for hard data about the effect of power cycling nodes to supplement Vinay's data and support my own anecdotal experience. If you have hard data about the effect of power cycling on system reliability, I would love to hear from you.

For the rest of us, let's pay attention to the realities of our equipment capabilities, and look at the real possibility that powering off a server is often the right thing to do.

Tuesday, September 04, 2007

An easy way to get started with SLAuto

It's been an interesting week, leading up to the Labor Day weekend, but as of this morning I get to talk more openly about one project that has been taking a great deal of my time. As I have blogged about Service Level Automation ("SLAuto"), it may have dawned on some of you that achieving nirvana here means changing a lot about your current architecture and practices.

For example, decoupling software from hardware is easy to say, but requires significant planning and execution to implement (though this can be simplified somewhat with the right platform). Building the correct monitors, policies and interfaces is also time intensive work that requires the correct platform for success. However, as noted before, the biggest barriers to implementing SLAuto and utility computing are cultural.

There is an opportunity out there right now to introduce SLAuto without all of the trappings of utility computing, especially the difficult decoupling of software from hardware. It is an opportunity that the Silicon Valley is going ga-ga over, and it is a real problem with real dollar costs for every data center on the planet.

The opportunity is energy consumption management, aka the "green data center".

Rather than pitch Cassatt's solution directly, I prefer to talk about the technical opportunity as a whole. So let's evaluate what is going on in the "GDC" space these days. As I see it, there are three basic technical approaches to "green" right now:

More efficient equipment, e.g. more power efficient chips, server architectures, power distribution systems, etc.
More efficient cooling, e.g. hot/cold aisles, liquid cooling, outside air systems, etc.
Consolidation, e.g. virtualization, mainframes, etc.

Still, there is something obvious missing here: no matter which of these technologies you consider, not one of them is actually going to turn off unused capacity. In other words, while everyone is working to build a better light bulb or to design your lighting so you need fewer bulbs, no one is turning off the lights when no-one is in the room.

That's where SLAuto comes in. I contend that there are huge tracks of computing in any large enterprise where compute capacity runs idle for extended periods. Desktop systems are certainly one of the biggest offenders, as are grid computing environments that are not pushed to maximum capacity at all times. However, possibly the biggest offender in any organization that does in-house development, extensive packaged system customization or business system integration is the dev/test environment.

Imagine such a lab where capacity that will be unused each evening/weekend, or for all but two weeks of a typical development cycle, or at all times except when testing a patch to a three year old rev of product, was shut down until needed. Turned off. Non-operational. Idle, but not idling.

Of course, most lab administrators probably feel extremely uncomfortable with this proposition. How are you going to do this without affecting developer/QA productivity? How do you know its OK to turn off a system? Why would my engineers even consider allowing their systems to be managed this way?

SLAuto addresses these concerns by simply applying intelligence to power management. A policy-based approach means a server can be scheduled for shutdown each evening (say, at 7PM), but be evaluated before shutdown against a set of policies that determine whether it is actually OK to complete the shut down.

Some example policies might be:

Are certain processes running that indicate a development/build/test task is still underway?
Is a specific user account logged in to the system right now?
Has disk activity been extremely low for the last four hours?
Did the owner of the server or one of his/her designated colleagues "opt-out" of the scheduled shutdown for that evening?

Once these policies are evaluated, we can see if the server meets the criteria to be shut down as requested. If not, keep it running. Such a system needs to also provide interfaces for both the data center administrators and the individual server owners/users to control the power state of their systems at all times, set policies and monitor power activities for managed servers.

I'll talk more about this in the coming week, but I welcome your input. Would you shut down servers in your lab? Your grid environment? Your production environment? What are your concerns with this approach? What policies come to mind that would be simple and/or difficult to implement?