I am trying to train a CNN-based depth completion model (Github Link) and am having some general problems training the model.
My basic procedure is to downsample my depth and input, upsample the prediction bilinearly to the ground truth resolution, and calculate the MSE loss on pixels that have a depth value > 0 in the ground truth.
Using the same model previously trained with KITTI leads to reasonable predictions. My goal is to train the network from scratch using my own data set.
Strangely, using more training data leads to even worse visual performance.
training on smaller training set
training on larger training set
My guess is that I am seeing some sort of convolution artifacts?
One of my main problems is that I'm using a relatively sparse ground truth and don't have values for every pixel in my target, so why would the network predict smooth and complete output at all? However, it seems that this type of training is possible in several papers.
How might I narrow down the problem with my training? In general, how do you know if the hyperparameters, the training implementation, the dataset, or the architecture is the problem?
I have also found that comparing common metrics like RMSE, MAE, a1, a2, and a3 is not really meaningful for the visual performance of the final depth result. Are there any better options?
I have visually evaluated my data for some sequences, tried different hyperparameters (with no real strategy), tried different ways of normalizing my data, and training on inverted depth values.
I would expect my own trained model to give better results both visually and metrically than the pre-trained model of another data set.