Some perfomance measures
The most surprising thing about ZIO for new users is the idea of a
"block" and a "control" (the 512 bytes thing) attached to every data
burst. This is considered a serious overhead.
I've made some performance measures on a recent PC-class computer.
In short, the overhead of bringing the block over the whole pipeline
(device, trigger, buffer, char device) is less than 0.3usec .
Measuring the overhead
zio-zero.ko and the transparent buffer (called
user and now
the default ai ZIO initialization), we can read or write huge amounts
We may compare with
/dev/zero but that would be unfair because the
/dev/zero implementation uses
copy_to_user(). The optimization of /dev/zero is specific to
/dev/zero, so that device isn't a meaningful test.
Channel 1 of cset 0 of
zio-zero returns random numbers. It uses
/dev/urandom does, so this is a fair
Acquisition in ZIO is cset-wide, so we should disable the other
to avoid the overhead of 3 blocks when we are only interested in 1 of them.
echo 0 > /sys/zio/devices/zzero/cset0/chan0/enable echo 0 > /sys/zio/devices/zzero/cset0/chan2/enable
The sample size in zio-zero is 1 byte, and the default block size is
16 sample (see
we can read 1 million times 16 bytes from
spusa.root# dd bs=16 count=1000000 if=/dev/urandom > /dev/null 1000000+0 records in 1000000+0 records out 16000000 bytes (16 MB) copied, 2.11017 s, 7.6 MB/s
We can then do the same with the ZIO device:
spusa.root# dd bs=16 count=1000000 if=/dev/zzero-0-1-data > /dev/null 1000000+0 records in 1000000+0 records out 16000000 bytes (16 MB) copied, 2.46607 s, 6.5 MB/s
The difference is .355 seconds, which means .355 microseconds per
I repeated the test several times, and picked one around the middle.
The oscillation between runs, on an unloaded machine, is within 1%.
The vmalloc buffer
vmalloc buffer does even better: there is no need to
read() data, just build a pointer to it. The difference is not very
/dev/urandom because generating the data takes most of the
processing time within the test.
The suggested test here is the following:
size=16; while [ $size -lt 64000 ]; do echo echo $size echo $size > /sys/zio/devices/zzero/cset0/trigger/nsamples n=$(expr 16 \* 1048576 / $size) dd bs=$size count=$n if=/dev/urandom of=/dev/null 2>&1 | grep copied /tmp/zio-cat-file /dev/zzero-0-1-data $n > /dev/null size=$(expr $size \* 2) done
This is done twice, using the two buffers available:
echo vmalloc > /sys/zio/devices/zzero/cset0/current_buffer echo kmalloc > /sys/zio/devices/zzero/cset0/current_buffer
And the result is plotted in the following figure.
The first numbers obtained, for 1048576 reads of 16 bytes , are:
dd: 2.213750 kmalloc: 2.763513 vmalloc: 2.603158
This means that the overhead of reading the full block (both
control and data) is .52 microseconds per block, while reading
the control and accessing data with
mmap is 0.37 microseconds
more than doing a plain
For large data sizes, the advantage of accessing data instead
of reading it will be more of this per-block overhead. However,
I have taken no measures so far.