Rather than trying to do an exhaustive coverage of this topic (and the other topics in this series) in a single post, what I am going to do is provide a "first look" post now, then use labels when followup posts have relative information. The label for this topic will be "measure".
In my next installment, I'll introduce analysis of those metrics against service level objectives (SLO) the business requires. That post, and future related posts will be labeled with "analyze".
In the final installment of the series, I'll describe the techniques and technologies available to digitally manipulate these systems so that they run within SLO parameters. Posts related to that topic will be labeled "respond".
As noted earlier, my objective is to survey the technologies, academics, etc., of each of these topics in an attempt to enlighten you about the science and technology that enables service level automation.
How do we measure quality of service?
Measuring quality of service is a complex problem, not so much because it is hard to measure information systems and business functionality. I (and I bet you) could list dozens of technical measurements that can be made on an application or service that would reflect some aspect of its current health. For example:
- System statistics such as CPU utilization or free disk space, as reported by SNMP
- Response to a ping or HTTP request
- Checksum processing on network data transfers
- Any of dozens of Web Services standards
The real problem is that human perception of quality of service isn't (typically) based on any one of these measurements, but on a combination of measurements, where the specific combination may change based on when and how a given business function is being used.
For example, how do you measure the utilization of a Citrix environment? Measuring sessions/instances is a good start, but--as noted before with WTS--what happens when all sessions consume a large amount of CPU at once? CPU utilization, in turn, could fluctuate wildly as sessions are more or less active. Then again, what about memory utilization or I/O throughput? These could become critical completely independently from the others already mentioned.
No, what is needed is more mathematical--one (or a couple of) index(es) of sorts generated from a combination of the base metrics retrieved from the managed system.
There are tools that do this. They range from the basic capabilities available in a good automation tool, to the sophisticated evaluation and computation available in a more specialized monitoring tool.
What I am still searching for are standard metrics being collected by these tools, especially industry standard metrics and/or indexes that demonstrate the health of a datacenter or its individual components. I'll talk more about what I find in the future, but welcome you to contribute here with links, comments, etc. to point me in the right direction.