I recently learned (initially from here) how to use mmap to quickly read a file in C, as in this example code:
// main.c
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#define INPUT_FILE "test.txt"
int main(int argc, char* argv) {
struct stat ss;
if (stat(INPUT_FILE, &ss)) {
fprintf(stderr, "stat err: %d (%s)\n", errno, strerror(errno));
return -1;
}
{
int fd = open(INPUT_FILE, O_RDONLY);
char* mapped = mmap(NULL, ss.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
close(fd);
fprintf(stdout, "%s\n", mapped);
munmap(mapped, ss.st_size);
}
return 0;
}
My understanding is that this use of mmap returns a pointer to length heap-allocated bytes.
I've tested this on plain text files, that are not explicitly null-terminated, e.g. a file with the 13-byte ascii string "hello, world!":
$ cat ./test.txt
hello, world!$
$ stat ./test.txt
File: ./test.txt
Size: 13 Blocks: 8 IO Block: 4096 regular file
Device: 810h/2064d Inode: 52441 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ user) Gid: ( 1000/ user)
Access: 2022-10-25 20:30:52.563772200 -0700
Modify: 2022-10-25 20:30:45.623772200 -0700
Change: 2022-10-25 20:30:45.623772200 -0700
Birth: -
When I run my compiled code, it never segfaults or spews garbage -- the classic symptoms of printing an unterminated C-string.
When I run my executable through gdb, mapped[13] is always '\0'.
Is this undefined behavior?
I can't see how it's possible that the bytes that are memory-mapped from the input file are reliably NULL-terminated.
For a 13-byte string, the "equivalent" that I would have normally done with malloc and read would be to allocate a 14-byte array, read from file to memory, then explicitly set byte 13 (0-based) to '\0'.
mmapreturns a pointer to whole pages allocated by the kernel. It doesn't go throughmalloc. Pages are usually 4096 bytes each and apparently the kernel fills the extra bytes with zeroes, not with garbage.