I'm encountering an issue while trying to train a neural network on a remote server using PuTTY or Visual Studio Code over SSH. The problem occurs after a few minutes of running my Python script for training, specifically after completing approximately 1 and a half epochs. I receive the following error:
packet_write_wait: Connection to xxx.xx.x.xxx port 22: Broken pipe
I've tried several solutions found online and made some changes to my SSH configuration, but so far, none of them seem to resolve the issue. Here are some of the solutions I've attempted:
Adding the following lines to my SSH configuration (not all at once):
Host *
IPQoS=throughput
Host *
ServerAliveInterval 20
TCPKeepAlive no
Host *
ServerAliveInterval 60
ServerAliveCountMax 10
However, none of these solutions have proven effective in preventing the SSH connection from being terminated prematurely.
How can I resolve this problem without the need for 'sudo' privileges?
Assuming the training is done on a Linux machine, a possible alternative is to run the code in a Screen session (Screen man page). Anything you run inside will continue to run even if you get disconnected or log out of the machine.
You can then detach the screen session by pressing Ctrl + A followed by Ctrl + D, and freely log out of the remote.
To go back to the Screen session, you ssh to the machine and run
screen -r training.