I have a database-like service that serves queries from an embedded key-value store (LMDB), where the data is 1. orders-of-magnitude larger than main memory; 2. rarely written to/updated (once per hour or less), 3. heavily read by many concurrent client-connection query threads; and where 4. different clients care about different parts of the data, so there's little spacial locality of queries, making the page cache fairly useless.
To optimize for IO throughput, the dataset served by this service is stored on a Linux MDRAID RAID0 array, backed by NVMe disks. With this configuration, I would expect random-read IO performance, even for the most naive blocking read(2) implementation of reads inside the storage engine, to at least scale with the number of concurrent connection threads, until the kernel+hardware IO queues of all disks become saturated. (And indeed, I get great random-read numbers from fio(1) for this setup.)
But this isn't happening; when running the actual service, the MDRAID device is seemingly bottlenecking far below that, with the IO queue size (aqu-sz in iostat(1)) hovering at or below 1.0 at all times. Under this application's workload, the disk is behaving as if it doesn't have any hardware IO-queue at all, and can only complete one operation for one thread at a time.
The CPU is basically idle, so this isn't a CPU bottleneck; and my test harness tells the service to discard the data it reads rather than serve it back over the wire, so network throughput is not the bottleneck either.
LMDB does its reads through mmap(2). Could this be the problem? Does the Linux kernel internally serialize IO triggered by faulting in mmap(2)ed disk pages, or is something else going on?
I think this.
I have seen similar issues with memory mapped random reads on Linux from NVMe. I haven’t pinned down the root cause just yet, and it might be related to me using an older longterm support kernel (5.4) - maybe newer kernels work differently. It seems to not scale beyond one thread as when I use multiple the IO appears to be serialised.
This paper mentions scaling issues with mmap and there being a shared lock. Even IO read operations need the lock, as they are modifying page tabes.
I am still running tests, but so far switching to pread for random reads seems to give more consistent and scalable latencies as more threads are used - especially when working with fast devices like NVMe.