Thursday, September 06, 2007

Fear and the Right Thing

An interesting thing about diving into the Active Power Management game is the incredible amount of FUD surrounding the simple act of turning a computer off. Considering the fact that:

  • server components manufactured in the last several years have intense MTBF values (measured in hundreds of thousands or even millions of hours),
  • the servers you will buy starting soon will all turn off pieces of themselves automagically, and
  • no one thinks twice about turning off laptop or desktop computers or their components,

the myth lives on that turning servers off and on is bad.

I understand where this concern comes from. Older disk drives were notoriously susceptible to problems with spin-up and spin-down. Don't get me started on power supplies in the late eighties and early nineties. My first job was as a sys admin for a small college where I regularly bought commodity 386 chassis power supplies and Seagate ST-220 disk drives. Even in the mid-nineties, older servers would all too frequently mysteriously die while we were restarting the system for an OS upgrade or after a system move.

Add to this the fact that enterprise computing went through its (first?) mainframe stage, where powering things off was contrary to the goal of using it as much as possible, you get a cultural mentality in IT that up time is king, even if system resources will be idle for great periods of time.

These days, though, the story has greatly changed. As Vinay documented, Cassatt starts and stops all of its QA and build servers every day. In over 18,826 power cycles, not a single system failed. In my interactions at customer sites, there have been zero failures in thousands of power cycles. Granted, that's not a scientific study, but it goes to the point that unexpected component failure is not a common occurrence during power cycles any more.

Of course, I'm looking for hard data about the effect of power cycling nodes to supplement Vinay's data and support my own anecdotal experience. If you have hard data about the effect of power cycling on system reliability, I would love to hear from you.

For the rest of us, let's pay attention to the realities of our equipment capabilities, and look at the real possibility that powering off a server is often the right thing to do.

No comments: