How To Interpret Disk Performance Data?

chuckmaginessfabals

New member
Joined
Mar 29, 2011
Messages
3
Location
UK
Hi. I'm logging performance counters on a 2008 R2 server, and I'm after a bit of advice on interpreting the results from the disk counters, if anyone can help. Sorry if what follows seems convoluted, but I'm confused by a lack of clear rules by which to interpret disk counters reliably.

Most of the MS stuff I've found just reiterates, and elaborates on, the 'Explain'/'Description' text, but one non-MS posting I found casts doubt on the reliablility of seemingly key counters like % Disk Time, Current Queue Length and Average Disk Queue Length. For example:

% Disk Time
===========
I've read that this counter is 'capped' and therefore, does 'not actually measure disk utilization'.

I also find it returns figures of several hundred per cent where RAID is involved, and I'm not sure it's as simple as dividing that by the number of disks in an array to get a meaningful figure.

Current Disk Queue Length
=========================
This counter is, apparently, unreliable, because, 'If requests are queued in the hardware, which is usual for SCSI disks and RAID controllers, the Current Disk Queue Length Counter will show a value of 0, even though requests are queued.'

Avg. Disk Queue Length
======================
This counter, I read, is derived from Avg.Disk sec/Transfer and Disk Transfers/sec, and requires an 'equilibrium assumption' to be factored in, namely, 'that the arrival rate equals the completion rate over the measurement interval. Otherwise, the calculation is meaningless.'

The corollary of this, apparently, is that the Ave. Disk Queue Length Counter value should not be accepted as reliable except where the current value of Current Disk Queue Length is the same as the previous value of Current Disk Queue Length.

In a recent log, the only instances of this were where the current and previous values for Current Disk Queue Length were 0 (though other values were recorded at other times). Given that 0 is supposedly an unreliable value for Current Disk Queue Length, does this render the Avg. Disk Queue Length values for these intervals meaningless?

Any advice on how to interpret these (and any other) disk counters to get meaningful figures on disk performance (specifically, whether the disk is a likely bottleneck) would be greatly appreciated.
 
hi there,

I've investigated years ago and what I found was:

each counter is UNreliable.

The problem is quite easy to understand: let's say you have this CPU:

Intel Core 2 Extreme QX9770 - 59,455 MIPS at 3.2 GHz - 18.6 (clock cycle)

As you can see it does some calculation per second, the maximal frequence is 3.2. If you open the task manager you can see the screen refresh is about 1 second, it shows you the % of cpu utilization, but it's not really up to date and it's not really correct (specially for multi-core systems). That's because the OS "talk" with the CPU which provide data to the task manager. This is JUST A CALCULATION but the CPU doesn't count each clock it does, it's more like an average. This is not a big problem because usually you just need to know how much your computer is busy. In other words, if you open the taskmanager and you see your CPU at 90%, you should investigate what's happening to the system.


Now let's talk about disks.

The discussion is almost the same. Access time is in milliseconds, but the display refresh is at least 1 second. If you open the performance monitor (which works like task manager), you can set the refresh time.

The problem with % disk time is ambiguos: what does it mean? % used or % reserved for next use? The answer is: BOTH! This % only depends on the amount of time needed to open that file. If you add the variable called RAID, this value is meaningless because of each controller has it own rules. THEORETICALLY if you have a RAID 0 you could divide by two the value, but this is not true everytime. If you have another RAID you can't divide the number by disks's number (in raid 5 you have a spare disk which actually doesn't count with I/O).

Now the problem with AVERAGE disk queue is clear: it makes a calculation based on others counters, that I said before... they're NOT correct at 100%, so if you start with "bad" data, the average will be uncorrect too!

Now many companies are going to virtualization and shared storage, this increase the unreliability of these counters. First of all you have a HAL (hardware abstraction layer), then you have the network, and finally you have the RAID of the SAN you are using!

Depending on what you really need, you can also use this tool:
http://technet.microsoft.com/en-us/sysinternals/bb896646

But honestly it doesn't do anything special...

I don't know if there's a way to monitor disks in a better way, but I also don't know any commercial software to do that. I think the only way could be use a device specific program with it's own driver. Also I think this shouldn't be installed under windows but directly on the hardware. As said... it's just an idea... I don't know if this works.

Let me know if I've answered to your questions.
 
hi there,

I've investigated years ago and what I found was:

each counter is UNreliable.

The problem is quite easy to understand: let's say you have this CPU:

Intel Core 2 Extreme QX9770 - 59,455 MIPS at 3.2 GHz - 18.6 (clock cycle)

As you can see it does some calculation per second, the maximal frequence is 3.2. If you open the task manager you can see the screen refresh is about 1 second, it shows you the % of cpu utilization, but it's not really up to date and it's not really correct (specially for multi-core systems). That's because the OS "talk" with the CPU which provide data to the task manager. This is JUST A CALCULATION but the CPU doesn't count each clock it does, it's more like an average. This is not a big problem because usually you just need to know how much your computer is busy. In other words, if you open the taskmanager and you see your CPU at 90%, you should investigate what's happening to the system.


Now let's talk about disks.

The discussion is almost the same. Access time is in milliseconds, but the display refresh is at least 1 second. If you open the performance monitor (which works like task manager), you can set the refresh time.

The problem with % disk time is ambiguos: what does it mean? % used or % reserved for next use? The answer is: BOTH! This % only depends on the amount of time needed to open that file. If you add the variable called RAID, this value is meaningless because of each controller has it own rules. THEORETICALLY if you have a RAID 0 you could divide by two the value, but this is not true everytime. If you have another RAID you can't divide the number by disks's number (in raid 5 you have a spare disk which actually doesn't count with I/O).

Now the problem with AVERAGE disk queue is clear: it makes a calculation based on others counters, that I said before... they're NOT correct at 100%, so if you start with "bad" data, the average will be uncorrect too!

Now many companies are going to virtualization and shared storage, this increase the unreliability of these counters. First of all you have a HAL (hardware abstraction layer), then you have the network, and finally you have the RAID of the SAN you are using!

Depending on what you really need, you can also use this tool:
http://technet.microsoft.com/en-us/sysinternals/bb896646

But honestly it doesn't do anything special...

I don't know if there's a way to monitor disks in a better way, but I also don't know any commercial software to do that. I think the only way could be use a device specific program with it's own driver. Also I think this shouldn't be installed under windows but directly on the hardware. As said... it's just an idea... I don't know if this works.

Let me know if I've answered to your questions.
 
Thanks very much for the reply - it corroborates what I've read elsewhere about the unreliability of perfmon disk counters. I'll have a look at DiskMon - thanks for the link - but if I understand you correctly, all or many of the same reliabiliity caveats apply?
 
Yes you're right, but MAYBE (I never tried / tested) it's a bit more precise... if you see the time, it's really precise. Anyway it doesn't provide many informations, you could try to export that list on excel but has said, you don't have many infos.
 
Back
Top