I know that modern CPUs use DMA (through PCIe, and maybe other buses ?) to transfer data from RAM to devices, and when such command is emitted the memory (or I/O) controller handles it and the CPU is free to do other work until the controller reports that the operation has completed.
Now I ask myself why we don't use DMA for regular, memory-to-memory copies in userland programs. I know that memory is fast nowadays but the CPU spends a lot of time during various buffer copies which just happen to be sequences of instructions (mostly in C's memcpy and variants) looping with a source address and destination address, making a read operation from source and a store operation to destination, and incrementing both pointers <buffer length> times. Not only does the data have to do a roundtrip between RAM and the CPU (first for reading from source, then for writing to destination), which may make the copy slower (as RAM is not on CPU chip), but also it makes the CPU spend cycles on that iterative operation without any particular reason (the RAM could do it alone since it already contains all the necessary inputs, as a copy does not transform the data) instead of doing something else.
So my question is why don't we use DMA for regular RAM-to-RAM buffer copies ?
Such memory copy operation (here in C, naive implementation):
void memcpy(void *dest, const void *src, size_t n)
{
for (size_t i = 0; i < n; i++)
((char *)dest)[i] = ((char *)src)[i];
}
would become something like :
void memcpy(void *dest, const void *src, size_t n)
{
// Call a kernel-provided function that setup a DMA command. It is likely to convert virtual addresses to physical addresses in kernel since the RAM controller is not aware of virtual memory mappings
int copy_token = dma_copy(dest, src, n); // This function would be O(1) since the RAM controller will handle the copy, the CPU is freed immediately after
/* ... do some work ... */
dma_wait_finish(copy_token); // Block the thread until the DMA controller reported a 'completed' event for the copy.
/* do something else after the copy has finished */
}
I am not really into kernel programming and I guess that most kernels (Linux, BSD, Windows) have DMA API for the purpose of I/O communication, but I don't understand why we don't do that for userland programs when doing memory-to-memory copy. Here a DMA call would happen inside the libC's memcpy function. Would not a RAM-controller-handled buffer copy be better than classic CPU-handled copy?