"packet_write_wait: Connection to xxx.xx.x.xxx port 22: Broken pipe" Error During Neural Network Training via SSH

131.6k Views Asked by At

I'm encountering an issue while trying to train a neural network on a remote server using PuTTY or Visual Studio Code over SSH. The problem occurs after a few minutes of running my Python script for training, specifically after completing approximately 1 and a half epochs. I receive the following error:

packet_write_wait: Connection to xxx.xx.x.xxx port 22: Broken pipe

I've tried several solutions found online and made some changes to my SSH configuration, but so far, none of them seem to resolve the issue. Here are some of the solutions I've attempted:

Adding the following lines to my SSH configuration (not all at once):

Host *
    IPQoS=throughput

Host *
    ServerAliveInterval 20
    TCPKeepAlive no

Host *
     ServerAliveInterval 60
     ServerAliveCountMax 10

However, none of these solutions have proven effective in preventing the SSH connection from being terminated prematurely.

How can I resolve this problem without the need for 'sudo' privileges?

1

There are 1 best solutions below

0
Anonymous On

Assuming the training is done on a Linux machine, a possible alternative is to run the code in a Screen session (Screen man page). Anything you run inside will continue to run even if you get disconnected or log out of the machine.

ssh you@remote
screen -S training # Start a screen session called "training"
python main.py # Start your training code

You can then detach the screen session by pressing Ctrl + A followed by Ctrl + D, and freely log out of the remote.

To go back to the Screen session, you ssh to the machine and run screen -r training.