Inconsistent MLPRegressor results across environments

36 Views Asked by At

I have trained an MLPRegressor with scikit-learn in Python. When the model is trained it is exported to ONNX format. The model is trained locally on an ARM M1 processor and later in production the model is deployed in a container on x86 and executed using an ONNX runtime (CPU-only). In some cases, the model gives wildly different results between environments, not attributable to simple round-off errors or small variations in floating point implementations. Some observations:

  • When the model is trained from scratch at runtime in production using same training data , it produces the same results as the local model. Transferring the locally trained model in the ONNX file to production produces wildly different results.
  • I have trained multiple instances of the model with different training data sets, and in a small number of cases (2 out of approx. 50) the model always gives the same results in all environments, whether training again at runtime in production or transferring the locally trained ONNX files.
  • There is no correlation between identical model files and identical results. With some instances, the model produces identical results in both environments when the ONNX files are different (as determined by a simple SHA checksum). In other cases the model produces different results when the ONNX files are identical between environments.
  • There is no correlation between the architecture on which the model is trained and run, and producing consistent results. In some cases, training on ARM and running on ARM produces the same result as training on ARM and running on x86. In other cases, training on ARM and running on ARM produces different results than training on ARM and running on x86.
  • I have replaced ONNX with plain Python Pickle, but no difference is observed.

What is going on here? Why can I not get consistent results with my models across environments? The only way to get consistent results is to train the model again at runtime, which is clearly not a practical or scalable solution. I have confirmed that the files being deployed in the container are indeed correct, i.e. they are the output of the local training process.

0

There are 0 best solutions below