PyTorch RuntimeError: Shared memory manager connection has timed out

43 Views Asked by At

I'm currently working on training models using PyTorch with the dataloader's num_worker set to be greater than 0. The memory sharing strategy is 'file_system'. The memory sharing strategy set to `file_system`. The processor in use is NPU.

The models have been trained in both single node multi-card and multi-node multi-card settings. After running for a certain period of time, an error appears as follows:

Traceback (most recent call last):
  File "/home/xxx/anaconda3/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/home/xxx/anaconda3/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protocol).dump(obj)
  File "/home/xxx/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 417, in reduce_storage
    metadata = storage._share_filename_cpu_()
  File "/home/xxx/anaconda3/lib/python3.10/site-packages/torch/storage.py", line 297, in wrapper
    return fn(self, *args, **kwargs)
  File "/home/xxx/anaconda3/lib/python3.10/site-packages/torch/storage.py", line 334, in _share_filename_cpu_
    return super()._share_filename_cpu_(*args, **kwargs)
RuntimeError: Shared memory manager connection has timed out

I searched for this error in pytorch and found that the error is reported in the recv function under torch/lib/libshm/socket.h:

class Socket {
 public:
  int socket_fd;

 protected:
  void recv(void* _buffer, size_t num_bytes) {
    char* buffer = (char*)_buffer;
    size_t bytes_received = 0;
    ssize_t step_received;
    struct pollfd pfd = {0};
    pfd.fd = socket_fd;
    pfd.events = POLLIN;
    while (bytes_received < num_bytes) {
      SYSCHECK_ERR_RETURN_NEG1(poll(&pfd, 1, 1000));
      if (pfd.revents & POLLIN) {
        SYSCHECK_ERR_RETURN_NEG1(
            step_received =
                ::read(socket_fd, buffer, num_bytes - bytes_received));
        if (step_received == 0)
          throw std::runtime_error("Other end has closed the connection");
        bytes_received += step_received;
        buffer += step_received;
      } else if (pfd.revents & (POLLERR | POLLHUP)) {
        throw std::runtime_error(
            "An error occurred while waiting for the data");
      } else {
        throw std::runtime_error(
            "Shared memory manager connection has timed out");
      }
    }
  }
}

Under the file_system sharing policy, pytorch raises a lot of torch_shm_manager daemons to prevent memory leaks, and I'm guessing that it's the client that times out the connection when communicating with the torch_shm_manager daemon, but it's still not clear why it's timing out. I tried setting the persistent_workers=True parameter to dataloader, but it seems to reduce the probability of the problem occurring, but I don't know if it completely solves the issue.

0

There are 0 best solutions below