Monday, December 18, 2006

The Organic Cluster

Several weeks have gone by since my last post (OK, several months...), so lots to talk about in the next few days. I want to start the conversation again by talking about one of the really cool advantages of Service Level Automation in enterprise distributed application architectures.

We've been talking about distributed application architectures, and how their tight coupling to underlying hardware and middleware has left most production applications running in infrastructure "silos", where spare capacity is locked up and unavailable to other systems.

Traditionally, nowhere has this been more obvious that with your database servers. Not designed to be scalable horizontally, these servers have relied on an excessively high amount of overprovisioning to be absolutely sure that performance was consistently high, regardless of actual demand. Do you need more capacity for your database than the current server provides? Then buy a bigger box and figure out how to migrate the database without crippling the business.

The exception to this story (right now) is ORACLE 10g RAC (Real Application Clustering). RAC is a grid-based database engine, which uses clustering technology to allow the RDBMS to be distributed across several servers. This increases availability greatly, and allows for an easier upgrade path when new hardware is required.

Unfortunately, as designed, the ORACLE cluster still requires the administrator to allocate excess capacity in case of high demand. Each node of the cluster must be running at all times (in the default architecture), which means each server must be dedicated to RAC whether it is needed or not.

A good Service Level Automation environment gives the administrator an interesting new capability, however. Because the ORACLE cluster can run with less than the maximum number of servers defined for the cluster (which is how the database keeps running when a server node is lost), it is possible to capture the cluster in an image library, and then allocate nodes only according to actual demand. No need for weird code or configuration changes to ORACLE, and no need to have spare capacity "dedicated" to RAC.

If more capacity is needed, the SLA environment will grab it from the pool of spare capacity available to all applications, not just RAC. When demand is detected to exceed the safety margins of the current "live" set of servers, the SLA environment boots up a new node, and RAC "rediscovers" its "lost" node. When demand falls away, the SLA environment shuts down an unneeded node, and RAC just detects that a node went down, but keeps on chugging.

Lest you think this is a pipe dream, my employer Cassatt has this running in its labs and has indeed provided proof-of-concept to prospective customers. And they like it. Which is another reason why Service Level Automation is changing the way IT runs.

No comments: