Does it make sense to preserve the model with the least historical loss in neural network training?

28 Views Asked by At

I am a beginner in bp neural networks, and I am trying to solve some physical and mechanical regression problems using bp neural networks.

In the process of learning neural networks, I found that the loss value of the model always decreased by oscillations. At present, all the learning materials I have been in contact with are based on the model parameters iteratively trained to the last time to build the final model and use it for testing the test set.

I tried to store the model with the lowest loss value in the training process and compared it with the final model at the end of the training on the test set, and found that the regression error of the "best" model stored was often smaller, and was often an order of magnitude smaller.

Since I am new to bp neural networks, and the method I used is very intuitive, but almost all of the learning materials I have been exposed to do not mention that this might optimize the final model performance, I would like to ask if I got the wrong "improvement" because I ignored some problems.

By the way, I am using Adam as the regression optimizer, and I find that no matter how many iterations, its loss is always smaller than that of sgd, and many papers I have read mention that SGD performs better than Adam in high iterations, but papers generally have more classification problems. I would like to ask, when facing regression problems, In most cases, there is a higher probability of which optimizer will perform better, and it would be better if you could provide optimizers other than Adam and SGD that might be better suited for regression problems in the torch library.

Thank you very much for your answer. Since I am not engaged in machine learning research, I only need to use bp neural network to do some regression work. There may be many problems in the expression of the above questions, and your correction is very welcome.

1

There are 1 best solutions below

0
Muhammed Yunus On

I tried to store the model with the lowest loss value in the training process and compared it with the final model at the end of the training on the test set, and found that the regression error of the "best" model stored was often smaller, and was often an order of magnitude smaller.

The method you describe sounds like similar to early stopping. When the validation loss/accuracy stops improving, training is stopped, and you pick the model that achieved the best validation score. Sometimes the score curves aren't smooth and it's not clear whether the model is getting worse or not, so you can define a "patience" term where you allow the model some time to improve after getting worse.

I think it could help to have a model that's a bit larger than you actually need when using early stopping. Early stopping will help prevent a large model from overfitting, but it's not useful for a model that is too small to begin with.

By the way, I am using Adam as the regression optimizer, and I find that no matter how many iterations, its loss is always smaller than that of sgd, and many papers I have read mention that SGD performs better than Adam in high iterations, but papers generally have more classification problems. I would like to ask, when facing regression problems, In most cases, there is a higher probability of which optimizer will perform better, and it would be better if you could provide optimizers other than Adam and SGD that might be better suited for regression problems in the torch library.

My impression is that Adam usually works well across a range of problems, and is a good default choice. If you find your network has problems learning, or you have the time to experiment, it's easy to swap Adam for something else and see if it does any better.

You'll often need to experiment to see what works with your data and for your specific task.