Userfaultfd fails register with ENOMEM after execv

67 Views Asked by At

I have a program (parent.rs) that will:

  • Fork
  • Create a userfaultfd on the child, then transfer it to the parent (with pidfd_getfd).
  • Execs the child (child.rs)
  • Child allocates memory with mmap and "sends" it to the parent
  • Parent receives the memory pointer and tries to register it with the transferred userfaultfd object

Unfortunately this fails at the last step with ENOMEM.

I'm aware that, due to virtual memory, the child pointer will not be valid on the parent. But I expect it to be valid to the userfaultfd handle I have, since it was originally created on the parent and, due to not having O_CLOEXEC, should remain open afterwards (and I've verified this by reading /proc/{child_pid}/fd).

Cargo.toml:

[package]
name = "..."
version = "0.1.0"
edition = "2021"

[dependencies]
nix = "0.26.2"
pidfd = "0.2.4"
pidfd_getfd = { version = "0.2.1", features = ["nightly"] }
pipe-channel = "1.3.0"
rustix = { version = "0.37.3", features = ["mm"] }
userfaultfd = { version = "0.5.1", features = ["linux4_14", "linux5_7"] }

examples/parent.rs:

use {
    nix::unistd,
    rustix::fd::{AsRawFd, FromRawFd},
    std::{
        ffi::{self, CString},
        mem,
    },
    userfaultfd::Uffd,
};

fn main() {
    let child_name = std::env::args().nth(1).expect("Expected argument");

    // Fork and execute the child
    let (mut uffd_tx, mut uffd_rx) = pipe_channel::channel();
    let (mut ready_tx, mut ready_rx) = pipe_channel::channel();
    let child_pid = match unsafe { unistd::fork() }.expect("Unable to fork") {
        unistd::ForkResult::Parent { child } => child,
        unistd::ForkResult::Child => {
            // Open the uffd and send it to the parent
            // Note: We forget it so it doesn't get closed.
            let uffd = userfaultfd::UffdBuilder::new()
                .close_on_exec(false)
                .user_mode_only(true)
                .non_blocking(false)
                .create()
                .expect("Unable to create uffd");
            uffd_tx.send(uffd.as_raw_fd()).expect("Unable to send uffd");
            mem::forget(uffd);

            // Wait until the monitor process is ready
            ready_rx.recv().expect("Unable to wait for parent");

            // Then execute the child
            println!("Executing child");
            let path = CString::new(child_name.clone()).unwrap();
            let args = [CString::new(child_name).unwrap()];
            unistd::execv(&path, &args).expect("Unable to `execv`");

            unreachable!();
        },
    };

    // Open a pid_fd for the child process
    let child_pidfd =
        unsafe { pidfd::PidFd::open(child_pid.as_raw(), 0) }.expect("Unable to allocate pidfd for child process");

    // Receive the uffd from the child
    let child_uffd_fd = uffd_rx.recv().expect("Unable to receive uffd");
    let uffd_fd = unsafe { pidfd_getfd::pidfd_getfd(child_pidfd.as_raw_fd(), child_uffd_fd, 0) };
    let uffd = unsafe { Uffd::from_raw_fd(uffd_fd) };

    // Tell the child we're ready to execute
    ready_tx.send(()).expect("Unable to send parent event");

    // Then read the pointer it wrote
    std::thread::sleep(std::time::Duration::from_secs(1));
    let page = std::fs::read("ptr").expect("Unable to read pointer");
    let page = page.try_into().expect("File wasn't the right size");
    let page = usize::from_le_bytes(page);
    let page = page as *mut ffi::c_void;

    // Prove the pointer is on the process's maps
    let memory_map =
        std::fs::read_to_string(format!("/proc/{}/maps", child_pid.as_raw())).expect("Unable to read memory maps");
    assert!(memory_map.contains(&format!("{:x}", page as usize)));

    // Then try to register
    uffd.register(page, 4096).expect("Unable to register dummy pointer");
}

examples/child.rs:

use {rustix::mm, std::ptr};

pub fn main() {
    // Allocate the page
    println!("Child: Allocating");
    let page = unsafe {
        mm::mmap_anonymous(
            ptr::null_mut(),
            4096,
            mm::ProtFlags::READ | mm::ProtFlags::WRITE,
            mm::MapFlags::PRIVATE,
        )
        .expect("Unable to allocate page")
    };

    // Write to file
    println!("Child: Writing to file");
    std::fs::write("ptr", (page as usize).to_le_bytes()).expect("Unable to write");

    println!("Child: Sleeping");
    loop {
        std::thread::park();
    }
}

This is run with cargo build --examples && ./target/debug/examples/parent ./target/debug/examples/child

(Note the uffd handler isn't here, but even with it it doesn't work, I removed it to create a smaller mvcp).

How can I make userfaultfd actually work here and register on the child?

If it helps, I'm using ptrace in my actual application, so I can control the child much more easily, if I need to do it at a specific time.

I have currently found a very hacky workaround using LD_PRELOAD, by creating the uffd object after execve on a shared library loaded by LD_PRELOAD, then transfering it to the parent with pipes (whose fds are given to the library by environment variables). This unfortunately is not great because:

  • I need to use the non-usermode + fork features of uffd, which requires the cap_sys_ptrace capability. In turn this capability disables the ability to LD_PRELOAD. I've heard that if the library has the setuid bit and is owned by root, it should still work, but I haven't been able to get it working. I've managed to fix this by setting the cap_sys_ptrace capability on the parent, then raising the Inheritable and Ambient capability sets of cap_sys_ptrace (still on the parent). This preserves the capability after execv, without disabling the ability to load a LD_PRELOAD library.
  • It delays the creation of the uffd. In my use case, I'm interested in tracking allocations, and some do happen before LD_PRELOAD initialization. This isn't a major issue, but it does complicate the design of the program, as now we need to go back and check all allocations that happened before the uufd was loaded.
  • It's a very hacky and unsecure method, which could easily break and/or have security vulnerabilities.
0

There are 0 best solutions below