I'm new to optimizing disk IO performance. I compared the performance of reading from file with or without direct IO enabled. The chunk size is 512KiB. As Direct IO reads data from disk directly to buffer in user space, I think Direct IO should be faster than non Direct IO(data is not cached before measurement). However, the result is that non Direct IO is much faster than Direct IO. But if I change the chunk size to 2MiB, speed is equal. Here is the test result:
ps@701083:/mnt/md0/cuda-learning$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
ps@701083:/mnt/md0/cuda-learning$ dd if=nodes-1G of=/dev/null iflag=direct bs=512K count=1024
1024+0 records in
1024+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 1.32862 s, 404 MB/s
ps@701083:/mnt/md0/cuda-learning$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
ps@701083:/mnt/md0/cuda-learning$ dd if=nodes-1G of=/dev/null bs=512K count=1024
1024+0 records in
1024+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 0.365581 s, 1.5 GB/s
ps@701083:/mnt/md0/cuda-learning$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
ps@701083:/mnt/md0/cuda-learning$ dd if=nodes-1G of=/dev/null bs=2M count=256
256+0 records in
256+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 0.370193 s, 1.5 GB/s
ps@701083:/mnt/md0/cuda-learning$ sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
ps@701083:/mnt/md0/cuda-learning$ dd if=nodes-1G of=/dev/null iflag=direct bs=2M count=256
256+0 records in
256+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 0.36575 s, 1.5 GB/s
output of df:
ps@701083:/mnt/md0/cuda-learning$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 32G 0 32G 0% /dev
tmpfs 6.3G 1.3M 6.3G 1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv 117G 28G 83G 26% /
tmpfs 32G 0 32G 0% /dev/shm
tmpfs 5.0M 4.0K 5.0M 1% /run/lock
tmpfs 32G 0 32G 0% /sys/fs/cgroup
/dev/md0 3.5T 756G 2.6T 23% /mnt/md0
/dev/nvme0n1p2 976M 204M 705M 23% /boot
/dev/nvme0n1p1 511M 6.7M 505M 2% /boot/efi
tmpfs 6.3G 0 6.3G 0% /run/user/1000
ps@701083:/mnt/md0/cuda-learning$
Why?