Alessandro Rubini · 65d46cd3
--- a/performance.md
+++ b/performance.md
 # Some perfomance measures
+The most surprising thing about ZIO for new users is the idea of a  
+"block" and a "control" (the 512 bytes thing) attached to every data  
+burst. This is considered a serious overhead.
+This is the block:
+![](/uploads/0be03d422b12400f848c332b5c18daee/zio-block.png)
+I've made some performance measures on a recent PC-class computer.  
+In short, the overhead of bringing the block over the whole pipeline  
+(device, trigger, buffer, char device) is less than 0.3usec .
+## Measuring the overhead
+With `zio-zero.ko` and the transparent buffer (called `user` and now  
+the default ai ZIO initialization), we can read or write huge amounts  
+of data.
+We may compare with `/dev/zero` but that would be unfair because the  
+`/dev/zero` implementation uses `__clear_user()`, not `memset()` and  
+`copy_to_user()`. The optimization of /dev/zero is specific to  
+/dev/zero, so that device isn't a meaningful test.
+Channel 1 of cset 0 of `zio-zero` returns random numbers. It uses  
+`get_random_bytes()` like `/dev/urandom` does, so this is a fair  
+comparison.
+Acquisition in ZIO is cset-wide, so we should disable the other
+channels,  
+to avoid the overhead of 3 blocks when we are only interested in 1 of
+them.
+    echo 0 > /sys/zio/devices/zzero/cset0/chan0/enable
+    echo 0 > /sys/zio/devices/zzero/cset0/chan2/enable
+The sample size in zio-zero is 1 byte, and the default block size is  
+16 sample (see `/sys/zio/devices/zzero/cset0/trigger/nsamples`). So  
+we can read 1 million times 16 bytes from `/dev/urandom`
+    spusa.root# dd bs=16 count=1000000 if=/dev/urandom  > /dev/null
+    1000000+0 records in
+    1000000+0 records out
+    16000000 bytes (16 MB) copied, 2.11017 s, 7.6 MB/s
+We can then do the same with the ZIO
+    device:
+    spusa.root# dd bs=16 count=1000000 if=/dev/zzero-0-1-data  > /dev/null
+    1000000+0 records in
+    1000000+0 records out
+    16000000 bytes (16 MB) copied, 2.46607 s, 6.5 MB/s
+The difference is .355 seconds, which means .355 microseconds per
+block.  
+I repeated the test several times, and picked one around the middle.  
+The oscillation between runs, on an unloaded machine, is within 1%.
+## The vmalloc buffer
+The new `vmalloc` buffer does even better: there is no need to  
+`read()` data, just build a pointer to it. The difference is not very  
+big with `/dev/urandom` because generating the data takes most of the  
+processing time within the test.
+The suggested test here is the following:
+    size=16; while [ $size -lt 64000 ]; do
+        echo
+        echo $size
+        echo $size > /sys/zio/devices/zzero/cset0/trigger/nsamples
+        n=$(expr 16 \* 1048576 / $size)
+        dd bs=$size count=$n if=/dev/urandom of=/dev/null 2>&1 | grep copied
+        /tmp/zio-cat-file /dev/zzero-0-1-data $n > /dev/null
+        size=$(expr $size \* 2)
+    done
+This is done twice, using the two buffers available:
+    echo vmalloc > /sys/zio/devices/zzero/cset0/current_buffer
+    echo kmalloc > /sys/zio/devices/zzero/cset0/current_buffer
+And the result is plotted in the following figure.  
+The first numbers obtained, for 1048576 reads of 16 bytes , are:
+    dd:        2.213750
+    kmalloc:   2.763513
+    vmalloc:   2.603158
+This means that the overhead of reading the full block (both  
+control and data) is .52 microseconds per block, while reading  
+the control and accessing data with `mmap` is 0.37 microseconds  
+ober a plain read.
+For large data sizes, the advantage of accessing data instead  
+of reading it will be more of this per-block overhead. However,  
+I have taken no measures so far.
 ### Files