Modern disks have minds of their own. When combined with an intelligent disk array or RAID unit, their behavior can be quite confusing. To make it more challenging, the disk monitoring utilities print a seemingly random selection of metrics. But don't worry, this mess can be sorted out, and all you need are the standard utilities. (3,100 words)
Q:
I monitor my disks with iostat, and sar, but these
two tools don't print the same numbers. There seem to be many
measurements available but it's not clear what they all mean. Finally,
what difference does a disk array make?
-- Diskless in Dodgeville
There are several options to iostat
, and there are indeed
slightly different metrics reported by sar
. To get to the
bottom of this I'll start by describing how disks really work, and the
low-level measurements that the kernel collects. The numbers printed by
iostat
and sar
make sense when you see how they are
derived from the underlying measurements.
If you've ever tried to get the same numbers from two copies of the same performance command, let alone two different commands, you've discovered that it is impossible to synchronize the measurements. Every disk access updates the metrics, and it's too hard to start the commands at exactly the same time and keep them running in step.
What is a disk, really?
Let's start by trying to understand the things that make up a modern
disk. There is more to today's disks than you may think. For a
start, the disk drive itself contains a CPU and memory. The CPU does
more than copy data to and from the SCSI bus. It can handle perhaps
64 commands at the same time, and will figure out the best
order to perform them in. If there are two files being read from the
disk at once, the requests will be interleaved as they arrive at the
disk drive. The drive can sort them out, and read a larger amount from
one file before seeking to the other file. Since the overall number of
seeks is reduced, overall performance is better. From the point of view
of a single file, however, the access times are less predictable. Some
accesses will be faster than expected, and some will be slower.
The RAM on the drive is used to hold the data for all those commands. It may also be used to hold prefetched data, i.e., the on-disk CPU guesses what you will ask for next, and tries to get it in advance if it is lightly loaded. When a disk gets busy the prefetch "guess" may get in the way of a real request, so it doesn't happen, and you may find that a busy disk with several competing sequential access streams performs worse than you would expect.
With SCSI, a disk is accessed by block number only. There is no way you can tell which sector, cylinder, or head is involved -- you just ask for the block. For some time now, SCSI disks have taken advantage of this to get higher densities. Even though the operating system and the Unix file system still work in cylinders, the entries in /etc/format.dat that define the disk geometry are almost pure fiction. The total size in blocks is all that matters.
On the "spinning rust" itself, the number of sectors per track varies. If you have ever played around with an old-style open reel audio tape drive you know that the tape speed can be varied, and that the higher the tape speed, the better the sound. On a disk, the outermost track passes the head faster than the innermost track. They both take the same time, but the circumference is longer. Modern disks store more data on the higher quality outermost track than they do on the innermost track, by turning up the data rate and putting more sectors on it. Why does this matter? Well, the disk blocks are numbered from outside edge in. When you slice a disk for use, you should find slice 0 faster than slice 7 because at 0 there are more sectors per track and a higher data rate. The performance difference from start to end can be as much as 30 percent, but performance falls off quite slowly during the first two-thirds of the disk.
Every now and again the disk's CPU will ignore the commands you are sending it, and go through a thermal recalibration cycle. This ensures that the head and the tracks are all perfectly aligned, but if it occurs at the moment you are attempting to access it, you may wait a lot longer than usual for your data. Some disk vendors sell special "multimedia" disk drives that try to avoid this problem. They are useful if you are trying to do I/O in real-time or replay a video from disk smoothly.
If you don't do anything for a while, many newer models start to power themselves down. This is a result of their internal Energy Star modes, which turn off sections of circuitry when they are not in use. Also, a firmware bug sometimes reports an error on lightly-used disk drives. The reason? The disk drive's CPU would go to sleep and upon awakening forget which speed zone it was in. After a while it would reset, seek, and recover, but there would be a worrying console message about read retries.
I don't have space here to talk about the SCSI bus itself, and there are no direct SCSI bus-specific metrics in the current OS.
The SCSI Host Bus Adapter
You might think of this as the SCSI controller, but all devices on the
SCSI bus act as controllers. The name Host Bus Adapter (HBA) indicates
that it connects the SCSI bus to the host computer bus. For SPARC
computers this is the SBus.
The SCSI HBA can be quite a complex and intelligent device, or it can be quite simple minded. The difference, from a performance perspective, is that a simple minded device will keep interrupting the OS so it can be told the next thing to do. The simplest (and most common) SBus SCSI HBA uses the "esp" device driver, and every SCSI command takes several interrupts to complete. The most intelligent (and expensive) SBus SCSI HBA uses the "isp" device driver. The SWI/S and DWI/S cards are "isp" based. It completes a whole SCSI operation on its own, and interrupts once when it is finished. This saves much system CPU time on the host. The latest SBus SCSI HBA uses the "fas" driver, it is found on UltraSPARC systems and it obsoletes the "esp." It is an improvement on the "esp," but less complex and expensive than the "isp," which remains the high-end option.
The SPARC Storage Array (SSA) and Fibre Channel
The SSA contains six "isp" style SCSI buses, and a dedicated SPARC
processor that keeps them fed, communicates with a host over Fibre
Channel, and manages non-volatile storage. This processor exercises
similar control over the disk drive's CPU. Commands
waiting to be completed are sorted into queues per device.
Adjacent I/O's can be coalesced into one I/O for the disk. If fast
writes are enabled for the disk, writes are put into non-volatile
storage (and written to disk later) and the host system is immediately
told that the I/O has completed. This provides a dramatic speedup for
many operations. The Fibre Channel's Serial Optical Controller (SOC)
acts like an intelligent SCSI HBA with a limit of 256 outstanding
commands.
The Solaris 2 operating system
There are two levels in the OS, the generic "sd" (or "ssd" for the
SPARC Storage Array) SCSI disk driver and the HBA specific
"esp/fas/isp/soc," that send SCSI commands to the device. At
the generic level a read or write system call becomes an entry
in a queue of commands waiting to be sent to a device.
If the device's own queue is full, or the SCSI bus is very busy,
the command may wait a long time.
When a read command is sent to the disk it becomes active, and inside the disk the queue of active commands is processed. When the disk has the data ready it is sent back to the HBA, which uses DMA to copy the data into memory. When the transfer is done the HBA interrupts the OS, which does some housekeeping work then returns from the read system call.
Solaris maintains a full set of counters and high-resolution timers
that are updated by each command. The initial arrival of a command
causes the wait queue length to be incremented and the time spent at
the previous queue length is accumulated as the product of length and
time. When a command is issued to the disk, another set of metrics count
the length and time spent in the active queue. The time that each queue
is empty is also noted, as is the size of the transfer. The data
structure maintained by the kernel is described in the
kstat(3)
manual page as follows:
typedef struct kstat_io { /* * Basic counters. */ u_longlong_t nread; /* number of bytes read */ u_longlong_t nwritten; /* number of bytes written */ ulong_t reads; /* number of read operations */ ulong_t writes; /* number of write operations */ /* * Accumulated time and queue length statistics. * * Time statistics are kept as a running sum of "active" time. * Queue length statistics are kept as a running sum of the * product of queue length and elapsed time at that length -- * i.e., a Riemann sum for queue length integrated against time. * * ^ * | _________ * 8 | i4 | * | | | * Queue 6 | | * Length | _________ | | * 4 | i2 |_______| | * | | i3 | * 2_______| | * | i1 | * |_______________________________| * Time-> t1 t2 t3 t4 * * At each change of state (entry or exit from the queue), * we add the elapsed time (since the previous state change) * to the active time if the queue length was non-zero during * that interval; and we add the product of the elapsed time * times the queue length to the running length*time sum. * * This method is generalizable to measuring residency * in any defined system: instead of queue lengths, think * of "outstanding RPC calls to server X." * * A large number of I/O subsystems have at least two basic * "lists" of transactions they manage: one for transactions * that have been accepted for processing but for which processing * has yet to begin, and one for transactions which are actively * being processed (but not done). For this reason, two cumulative * time statistics are defined here: pre-service (wait) time, * and service (run) time. * * The units of cumulative busy time are accumulated nanoseconds. * The units of cumulative length*time products are elapsed time * multiplied by queue length. */ hrtime_t wtime; /* cumulative wait (pre-service) time */ hrtime_t wlentime; /* cumulative wait length*time product */ hrtime_t wlastupdate; /* last time wait queue changed */ hrtime_t rtime; /* cumulative run (service) time */ hrtime_t rlentime; /* cumulative run length*time product */ hrtime_t rlastupdate; /* last time run queue changed */ ulong_t wcnt; /* count of elements in wait state */ ulong_t rcnt; /* count of elements in run state */ } kstat_io_t;From these measures it is possible to calculate all the numbers that are printed by
iostat
and sar -d
. For some example
code that does these calculations you could look at the per-disk
iostat
class in the SE toolkit. I'll summarize the math in
the next section.
The basic requirement is to have two copies of the data, separated by a time interval. You can then work out the statistics for that time interval. Every disk on the system has its own copy of this data, so you need to store the data for every disk, wait a while, then re-read it again for every disk. Each measurement has its own high resolution 64-bit timestamp, and the timestamp is guaranteed to be stable and monotonic.
Some older operating systems (e.g., SunOS 4) use the system's clock
tick counter. This is subject to change if someone sets the date, and
may be temporarily sped up or slowed down if network-wide time
synchronization is in use. When this occurs some performance metrics
can become inaccurate. The hires counter (which can be accessed by way of the
gethrtime(3C)
library call) solves this problem for Solaris 2.
Disk Statistics
% sar -d 1 1 SunOS hostname 5.5 Generic sun4u 05/23/96 11:34:56 device %busy avque r+w/s blks/s avwait avserv 11:34:57 sd3 0 0.0 0 0 0.0 0.0 % iostat tty sd3 cpu tin tout Kps tps serv us sy wt id 0 0 2 0 80 1 1 1 98 % iostat -D sd3 rps wps util 0 0 0.9 % iostat -x extended disk statistics disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b sd3 0.2 0.1 1.3 1.2 0.0 0.0 79.6 0 1I have shown
sar -d
and the three forms of iostat
. As
you can see, some of the headings match and some look as if they might
be the same, but with different names. Let's go through them in turn.
hr_etime = disk.wlastupdate - old_wlastupdate; r_pct = busy = util = (disk.rtime - old_rtime * 100.0) / hr_etime;
avwait_len = (disk.wlentime - old_wlentime) / hr_etime; avrun_len = (disk.rlentime - old_rlentime) / hr_etime; avque = avwait_len + avrun_len;
rps = ((disk.reads - old_rps)) / etime; wps = ((disk.writes - old_wps)) / etime; rwps = tps = rps + wps;
krps = (disk.nread - old_nread) / etime / 1024.0; kwps = (disk.nwritten - old_nwritten) / etime / 1024.0; kps = krps + kwps; blkps = kps * 2;
avwait = avwait_time = tps > 0? avwait_len / tps * 1000.0 : 0.0; avserv = avrun_time = tps > 0? avrun_len / tps * 1000.0 : 0.0; svc_t = tps > 0? avque / tps * 1000.0 : 0.0;
w_pct = (disk.wtime - old_wtime * 100.0) / (disk.wlastupdate - old_wlastupdate);
iostat
class.
This is part of the SE toolkit. Of course, if you don't like the
combinations of metrics that iostat
or sar
print, you
can use SE to build your own custom utility. It really is a few minutes
work if you use the xiostat.se
script that clones iostat
-x
as a starting point.
An extreme example
Here is a SPARCstorage Array disk being hit by an extreme
write load benchmark. I want to show a different set of metrics from
any of the above tools -- I'll leave the required SE script as an
exercise for the reader.
Disk %busy avque await avserv rps wps krps kwps ssd0 100.00 907.1 17254.9 6784.1 0 38 0 4219The disk is ssd0, which is the system disk. It is 100 percent busy. There are 907 commands queued on it. On average command waits for 17 seconds, then gets sent to the SPARCstorage Array, where they it sits in the active queue for 6.7 seconds. 38 commands are processed per second, adding up to 4219 kilobytes per second being written.
This works out to 111 KB per write (4219/38) on average. 17254.9 + 6784.1 = 24039 ms (24 seconds) / 907.1 = 26.5ms per IO. This is about the length of time that you would expect a large IO to take. The problem is that the benchmark keeps issuing writes faster than the disk can keep up. The time spent in the SPARCstorage Array queue is 6784ms, at 26.5 per I/O this is exactly 256 commands, which is the active queue depth for the SOC interface, as I described earlier.
The symptom reported to me as a problem was that the system locked up for the duration of the test. The reason it locked is that just about any command will need to page-in its code from the system disk, and that page-in is being squeezed out by the benchmark. When running the test on another disk the system was fine.
Next month
I'll take a look at compiler options for regular use and also some
extra options that improve floating point performance on UltraSPARC
systems.
About the author
Adrian Cockcroft joined Sun in
1988, and currently works as a performance specialist for the
Server Division of SMCC. He wrote Sun Performance and Tuning: SPARC and Solaris, published by
SunSoft Press PTR Prentice Hall.Reach him at
adrian.cockcroft@sunworld.com.
The answers to questions posed in this column are those of the author, and do not represent the views of Sun Microsystems Inc.
Resources
percollator.se
general information
http://www.sun.com/sun-on-net/www.sun.com/percol/Percollator.html
If you have problems with this magazine, contact
webmaster@sunworld.com
URL: http://www.sun.com/sunworldonline/swol-06-1996/swol-06-perf.html
Last updated: 1 June 1996