I'm currently working on training models using PyTorch with the dataloader's num_worker set to be greater than 0. The memory sharing strategy is 'file_system'. The memory sharing strategy set to `file_system`. The processor in use is NPU.
The models have been trained in both single node multi-card and multi-node multi-card settings. After running for a certain period of time, an error appears as follows:
Traceback (most recent call last):
File "/home/xxx/anaconda3/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/home/xxx/anaconda3/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protocol).dump(obj)
File "/home/xxx/anaconda3/lib/python3.10/site-packages/torch/multiprocessing/reductions.py", line 417, in reduce_storage
metadata = storage._share_filename_cpu_()
File "/home/xxx/anaconda3/lib/python3.10/site-packages/torch/storage.py", line 297, in wrapper
return fn(self, *args, **kwargs)
File "/home/xxx/anaconda3/lib/python3.10/site-packages/torch/storage.py", line 334, in _share_filename_cpu_
return super()._share_filename_cpu_(*args, **kwargs)
RuntimeError: Shared memory manager connection has timed out
I searched for this error in pytorch and found that the error is reported in the recv function under torch/lib/libshm/socket.h:
class Socket {
public:
int socket_fd;
protected:
void recv(void* _buffer, size_t num_bytes) {
char* buffer = (char*)_buffer;
size_t bytes_received = 0;
ssize_t step_received;
struct pollfd pfd = {0};
pfd.fd = socket_fd;
pfd.events = POLLIN;
while (bytes_received < num_bytes) {
SYSCHECK_ERR_RETURN_NEG1(poll(&pfd, 1, 1000));
if (pfd.revents & POLLIN) {
SYSCHECK_ERR_RETURN_NEG1(
step_received =
::read(socket_fd, buffer, num_bytes - bytes_received));
if (step_received == 0)
throw std::runtime_error("Other end has closed the connection");
bytes_received += step_received;
buffer += step_received;
} else if (pfd.revents & (POLLERR | POLLHUP)) {
throw std::runtime_error(
"An error occurred while waiting for the data");
} else {
throw std::runtime_error(
"Shared memory manager connection has timed out");
}
}
}
}
Under the file_system sharing policy, pytorch raises a lot of torch_shm_manager daemons to prevent memory leaks, and I'm guessing that it's the client that times out the connection when communicating with the torch_shm_manager daemon, but it's still not clear why it's timing out. I tried setting the persistent_workers=True parameter to dataloader, but it seems to reduce the probability of the problem occurring, but I don't know if it completely solves the issue.