I have created the model and save the weights using google colab. Now I have created a prediction script. The prediction script contains the model class. I am trying to load the model weights using the following method-
Saving & Loading Model Across Devices
Save on GPU, Load on CPU Save:
torch.save(model.state_dict(), PATH)
Load:
device = torch.device('cpu')
model = TheModelClass(*args, **kwargs)
model.load_state_dict(torch.load(PATH, map_location=device))
The above method should work, right? Yes.
But when I am trying to do so I have different parameters of the model in Google Colab (Prediction, runtime-None, device=CPU) and different in my local machine (prediction, device=cpu)
Model Params in Colab-
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')
The model has 12,490,234 trainable parameters
+-------------------------------------------------------+------------+
| Modules | Parameters |
+-------------------------------------------------------+------------+
| encoder.tok_embedding.weight | 2053376 |
| encoder.pos_embedding.weight | 25600 |
| encoder.layers.0.self_attn_layer_norm.weight | 256 |
| encoder.layers.0.self_attn_layer_norm.bias | 256 |
| encoder.layers.0.ff_layer_norm.weight | 256 |
| encoder.layers.0.ff_layer_norm.bias | 256 |
| encoder.layers.0.self_attention.fc_q.weight | 65536 |
| encoder.layers.0.self_attention.fc_q.bias | 256 |
| encoder.layers.0.self_attention.fc_k.weight | 65536 |
| encoder.layers.0.self_attention.fc_k.bias | 256 |
| encoder.layers.0.self_attention.fc_v.weight | 65536 |
| encoder.layers.0.self_attention.fc_v.bias | 256 |
| encoder.layers.0.self_attention.fc_o.weight | 65536 |
| encoder.layers.0.self_attention.fc_o.bias | 256 |
| encoder.layers.0.positionwise_feedforward.fc_1.weight | 131072 |
| encoder.layers.0.positionwise_feedforward.fc_1.bias | 512 |
| encoder.layers.0.positionwise_feedforward.fc_2.weight | 131072 |
| encoder.layers.0.positionwise_feedforward.fc_2.bias | 256 |
| encoder.layers.1.self_attn_layer_norm.weight | 256 |
| encoder.layers.1.self_attn_layer_norm.bias | 256 |
| encoder.layers.1.ff_layer_norm.weight | 256 |
| encoder.layers.1.ff_layer_norm.bias | 256 |
| encoder.layers.1.self_attention.fc_q.weight | 65536 |
| encoder.layers.1.self_attention.fc_q.bias | 256 |
| encoder.layers.1.self_attention.fc_k.weight | 65536 |
| encoder.layers.1.self_attention.fc_k.bias | 256 |
| encoder.layers.1.self_attention.fc_v.weight | 65536 |
| encoder.layers.1.self_attention.fc_v.bias | 256 |
| encoder.layers.1.self_attention.fc_o.weight | 65536 |
| encoder.layers.1.self_attention.fc_o.bias | 256 |
| encoder.layers.1.positionwise_feedforward.fc_1.weight | 131072 |
| encoder.layers.1.positionwise_feedforward.fc_1.bias | 512 |
| encoder.layers.1.positionwise_feedforward.fc_2.weight | 131072 |
| encoder.layers.1.positionwise_feedforward.fc_2.bias | 256 |
| encoder.layers.2.self_attn_layer_norm.weight | 256 |
| encoder.layers.2.self_attn_layer_norm.bias | 256 |
| encoder.layers.2.ff_layer_norm.weight | 256 |
| encoder.layers.2.ff_layer_norm.bias | 256 |
| encoder.layers.2.self_attention.fc_q.weight | 65536 |
| encoder.layers.2.self_attention.fc_q.bias | 256 |
| encoder.layers.2.self_attention.fc_k.weight | 65536 |
| encoder.layers.2.self_attention.fc_k.bias | 256 |
| encoder.layers.2.self_attention.fc_v.weight | 65536 |
| encoder.layers.2.self_attention.fc_v.bias | 256 |
| encoder.layers.2.self_attention.fc_o.weight | 65536 |
| encoder.layers.2.self_attention.fc_o.bias | 256 |
| encoder.layers.2.positionwise_feedforward.fc_1.weight | 131072 |
| encoder.layers.2.positionwise_feedforward.fc_1.bias | 512 |
| encoder.layers.2.positionwise_feedforward.fc_2.weight | 131072 |
| encoder.layers.2.positionwise_feedforward.fc_2.bias | 256 |
| decoder.tok_embedding.weight | 3209728 |
| decoder.pos_embedding.weight | 25600 |
| decoder.layers.0.self_attn_layer_norm.weight | 256 |
| decoder.layers.0.self_attn_layer_norm.bias | 256 |
| decoder.layers.0.enc_attn_layer_norm.weight | 256 |
| decoder.layers.0.enc_attn_layer_norm.bias | 256 |
| decoder.layers.0.ff_layer_norm.weight | 256 |
| decoder.layers.0.ff_layer_norm.bias | 256 |
| decoder.layers.0.self_attention.fc_q.weight | 65536 |
| decoder.layers.0.self_attention.fc_q.bias | 256 |
| decoder.layers.0.self_attention.fc_k.weight | 65536 |
| decoder.layers.0.self_attention.fc_k.bias | 256 |
| decoder.layers.0.self_attention.fc_v.weight | 65536 |
| decoder.layers.0.self_attention.fc_v.bias | 256 |
| decoder.layers.0.self_attention.fc_o.weight | 65536 |
| decoder.layers.0.self_attention.fc_o.bias | 256 |
| decoder.layers.0.encoder_attention.fc_q.weight | 65536 |
| decoder.layers.0.encoder_attention.fc_q.bias | 256 |
| decoder.layers.0.encoder_attention.fc_k.weight | 65536 |
| decoder.layers.0.encoder_attention.fc_k.bias | 256 |
| decoder.layers.0.encoder_attention.fc_v.weight | 65536 |
| decoder.layers.0.encoder_attention.fc_v.bias | 256 |
| decoder.layers.0.encoder_attention.fc_o.weight | 65536 |
| decoder.layers.0.encoder_attention.fc_o.bias | 256 |
| decoder.layers.0.positionwise_feedforward.fc_1.weight | 131072 |
| decoder.layers.0.positionwise_feedforward.fc_1.bias | 512 |
| decoder.layers.0.positionwise_feedforward.fc_2.weight | 131072 |
| decoder.layers.0.positionwise_feedforward.fc_2.bias | 256 |
| decoder.layers.1.self_attn_layer_norm.weight | 256 |
| decoder.layers.1.self_attn_layer_norm.bias | 256 |
| decoder.layers.1.enc_attn_layer_norm.weight | 256 |
| decoder.layers.1.enc_attn_layer_norm.bias | 256 |
| decoder.layers.1.ff_layer_norm.weight | 256 |
| decoder.layers.1.ff_layer_norm.bias | 256 |
| decoder.layers.1.self_attention.fc_q.weight | 65536 |
| decoder.layers.1.self_attention.fc_q.bias | 256 |
| decoder.layers.1.self_attention.fc_k.weight | 65536 |
| decoder.layers.1.self_attention.fc_k.bias | 256 |
| decoder.layers.1.self_attention.fc_v.weight | 65536 |
| decoder.layers.1.self_attention.fc_v.bias | 256 |
| decoder.layers.1.self_attention.fc_o.weight | 65536 |
| decoder.layers.1.self_attention.fc_o.bias | 256 |
| decoder.layers.1.encoder_attention.fc_q.weight | 65536 |
| decoder.layers.1.encoder_attention.fc_q.bias | 256 |
| decoder.layers.1.encoder_attention.fc_k.weight | 65536 |
| decoder.layers.1.encoder_attention.fc_k.bias | 256 |
| decoder.layers.1.encoder_attention.fc_v.weight | 65536 |
| decoder.layers.1.encoder_attention.fc_v.bias | 256 |
| decoder.layers.1.encoder_attention.fc_o.weight | 65536 |
| decoder.layers.1.encoder_attention.fc_o.bias | 256 |
| decoder.layers.1.positionwise_feedforward.fc_1.weight | 131072 |
| decoder.layers.1.positionwise_feedforward.fc_1.bias | 512 |
| decoder.layers.1.positionwise_feedforward.fc_2.weight | 131072 |
| decoder.layers.1.positionwise_feedforward.fc_2.bias | 256 |
| decoder.layers.2.self_attn_layer_norm.weight | 256 |
| decoder.layers.2.self_attn_layer_norm.bias | 256 |
| decoder.layers.2.enc_attn_layer_norm.weight | 256 |
| decoder.layers.2.enc_attn_layer_norm.bias | 256 |
| decoder.layers.2.ff_layer_norm.weight | 256 |
| decoder.layers.2.ff_layer_norm.bias | 256 |
| decoder.layers.2.self_attention.fc_q.weight | 65536 |
| decoder.layers.2.self_attention.fc_q.bias | 256 |
| decoder.layers.2.self_attention.fc_k.weight | 65536 |
| decoder.layers.2.self_attention.fc_k.bias | 256 |
| decoder.layers.2.self_attention.fc_v.weight | 65536 |
| decoder.layers.2.self_attention.fc_v.bias | 256 |
| decoder.layers.2.self_attention.fc_o.weight | 65536 |
| decoder.layers.2.self_attention.fc_o.bias | 256 |
| decoder.layers.2.encoder_attention.fc_q.weight | 65536 |
| decoder.layers.2.encoder_attention.fc_q.bias | 256 |
| decoder.layers.2.encoder_attention.fc_k.weight | 65536 |
| decoder.layers.2.encoder_attention.fc_k.bias | 256 |
| decoder.layers.2.encoder_attention.fc_v.weight | 65536 |
| decoder.layers.2.encoder_attention.fc_v.bias | 256 |
| decoder.layers.2.encoder_attention.fc_o.weight | 65536 |
| decoder.layers.2.encoder_attention.fc_o.bias | 256 |
| decoder.layers.2.positionwise_feedforward.fc_1.weight | 131072 |
| decoder.layers.2.positionwise_feedforward.fc_1.bias | 512 |
| decoder.layers.2.positionwise_feedforward.fc_2.weight | 131072 |
| decoder.layers.2.positionwise_feedforward.fc_2.bias | 256 |
| decoder.fc_out.weight | 3209728 |
| decoder.fc_out.bias | 12538 |
+-------------------------------------------------------+------------+
Total Trainable Params: 12490234
Model Params in Local-
def count_parameters(model):
return sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f'The model has {count_parameters(model):,} trainable parameters')
The model has 12,506,137 trainable parameters
+-------------------------------------------------------+------------+
| Modules | Parameters |
+-------------------------------------------------------+------------+
| encoder.tok_embedding.weight | 2053376 |
| encoder.pos_embedding.weight | 25600 |
| encoder.layers.0.self_attn_layer_norm.weight | 256 |
| encoder.layers.0.self_attn_layer_norm.bias | 256 |
| encoder.layers.0.ff_layer_norm.weight | 256 |
| encoder.layers.0.ff_layer_norm.bias | 256 |
| encoder.layers.0.self_attention.fc_q.weight | 65536 |
| encoder.layers.0.self_attention.fc_q.bias | 256 |
| encoder.layers.0.self_attention.fc_k.weight | 65536 |
| encoder.layers.0.self_attention.fc_k.bias | 256 |
| encoder.layers.0.self_attention.fc_v.weight | 65536 |
| encoder.layers.0.self_attention.fc_v.bias | 256 |
| encoder.layers.0.self_attention.fc_o.weight | 65536 |
| encoder.layers.0.self_attention.fc_o.bias | 256 |
| encoder.layers.0.positionwise_feedforward.fc_1.weight | 131072 |
| encoder.layers.0.positionwise_feedforward.fc_1.bias | 512 |
| encoder.layers.0.positionwise_feedforward.fc_2.weight | 131072 |
| encoder.layers.0.positionwise_feedforward.fc_2.bias | 256 |
| encoder.layers.1.self_attn_layer_norm.weight | 256 |
| encoder.layers.1.self_attn_layer_norm.bias | 256 |
| encoder.layers.1.ff_layer_norm.weight | 256 |
| encoder.layers.1.ff_layer_norm.bias | 256 |
| encoder.layers.1.self_attention.fc_q.weight | 65536 |
| encoder.layers.1.self_attention.fc_q.bias | 256 |
| encoder.layers.1.self_attention.fc_k.weight | 65536 |
| encoder.layers.1.self_attention.fc_k.bias | 256 |
| encoder.layers.1.self_attention.fc_v.weight | 65536 |
| encoder.layers.1.self_attention.fc_v.bias | 256 |
| encoder.layers.1.self_attention.fc_o.weight | 65536 |
| encoder.layers.1.self_attention.fc_o.bias | 256 |
| encoder.layers.1.positionwise_feedforward.fc_1.weight | 131072 |
| encoder.layers.1.positionwise_feedforward.fc_1.bias | 512 |
| encoder.layers.1.positionwise_feedforward.fc_2.weight | 131072 |
| encoder.layers.1.positionwise_feedforward.fc_2.bias | 256 |
| encoder.layers.2.self_attn_layer_norm.weight | 256 |
| encoder.layers.2.self_attn_layer_norm.bias | 256 |
| encoder.layers.2.ff_layer_norm.weight | 256 |
| encoder.layers.2.ff_layer_norm.bias | 256 |
| encoder.layers.2.self_attention.fc_q.weight | 65536 |
| encoder.layers.2.self_attention.fc_q.bias | 256 |
| encoder.layers.2.self_attention.fc_k.weight | 65536 |
| encoder.layers.2.self_attention.fc_k.bias | 256 |
| encoder.layers.2.self_attention.fc_v.weight | 65536 |
| encoder.layers.2.self_attention.fc_v.bias | 256 |
| encoder.layers.2.self_attention.fc_o.weight | 65536 |
| encoder.layers.2.self_attention.fc_o.bias | 256 |
| encoder.layers.2.positionwise_feedforward.fc_1.weight | 131072 |
| encoder.layers.2.positionwise_feedforward.fc_1.bias | 512 |
| encoder.layers.2.positionwise_feedforward.fc_2.weight | 131072 |
| encoder.layers.2.positionwise_feedforward.fc_2.bias | 256 |
| decoder.tok_embedding.weight | 3217664 |
| decoder.pos_embedding.weight | 25600 |
| decoder.layers.0.self_attn_layer_norm.weight | 256 |
| decoder.layers.0.self_attn_layer_norm.bias | 256 |
| decoder.layers.0.enc_attn_layer_norm.weight | 256 |
| decoder.layers.0.enc_attn_layer_norm.bias | 256 |
| decoder.layers.0.ff_layer_norm.weight | 256 |
| decoder.layers.0.ff_layer_norm.bias | 256 |
| decoder.layers.0.self_attention.fc_q.weight | 65536 |
| decoder.layers.0.self_attention.fc_q.bias | 256 |
| decoder.layers.0.self_attention.fc_k.weight | 65536 |
| decoder.layers.0.self_attention.fc_k.bias | 256 |
| decoder.layers.0.self_attention.fc_v.weight | 65536 |
| decoder.layers.0.self_attention.fc_v.bias | 256 |
| decoder.layers.0.self_attention.fc_o.weight | 65536 |
| decoder.layers.0.self_attention.fc_o.bias | 256 |
| decoder.layers.0.encoder_attention.fc_q.weight | 65536 |
| decoder.layers.0.encoder_attention.fc_q.bias | 256 |
| decoder.layers.0.encoder_attention.fc_k.weight | 65536 |
| decoder.layers.0.encoder_attention.fc_k.bias | 256 |
| decoder.layers.0.encoder_attention.fc_v.weight | 65536 |
| decoder.layers.0.encoder_attention.fc_v.bias | 256 |
| decoder.layers.0.encoder_attention.fc_o.weight | 65536 |
| decoder.layers.0.encoder_attention.fc_o.bias | 256 |
| decoder.layers.0.positionwise_feedforward.fc_1.weight | 131072 |
| decoder.layers.0.positionwise_feedforward.fc_1.bias | 512 |
| decoder.layers.0.positionwise_feedforward.fc_2.weight | 131072 |
| decoder.layers.0.positionwise_feedforward.fc_2.bias | 256 |
| decoder.layers.1.self_attn_layer_norm.weight | 256 |
| decoder.layers.1.self_attn_layer_norm.bias | 256 |
| decoder.layers.1.enc_attn_layer_norm.weight | 256 |
| decoder.layers.1.enc_attn_layer_norm.bias | 256 |
| decoder.layers.1.ff_layer_norm.weight | 256 |
| decoder.layers.1.ff_layer_norm.bias | 256 |
| decoder.layers.1.self_attention.fc_q.weight | 65536 |
| decoder.layers.1.self_attention.fc_q.bias | 256 |
| decoder.layers.1.self_attention.fc_k.weight | 65536 |
| decoder.layers.1.self_attention.fc_k.bias | 256 |
| decoder.layers.1.self_attention.fc_v.weight | 65536 |
| decoder.layers.1.self_attention.fc_v.bias | 256 |
| decoder.layers.1.self_attention.fc_o.weight | 65536 |
| decoder.layers.1.self_attention.fc_o.bias | 256 |
| decoder.layers.1.encoder_attention.fc_q.weight | 65536 |
| decoder.layers.1.encoder_attention.fc_q.bias | 256 |
| decoder.layers.1.encoder_attention.fc_k.weight | 65536 |
| decoder.layers.1.encoder_attention.fc_k.bias | 256 |
| decoder.layers.1.encoder_attention.fc_v.weight | 65536 |
| decoder.layers.1.encoder_attention.fc_v.bias | 256 |
| decoder.layers.1.encoder_attention.fc_o.weight | 65536 |
| decoder.layers.1.encoder_attention.fc_o.bias | 256 |
| decoder.layers.1.positionwise_feedforward.fc_1.weight | 131072 |
| decoder.layers.1.positionwise_feedforward.fc_1.bias | 512 |
| decoder.layers.1.positionwise_feedforward.fc_2.weight | 131072 |
| decoder.layers.1.positionwise_feedforward.fc_2.bias | 256 |
| decoder.layers.2.self_attn_layer_norm.weight | 256 |
| decoder.layers.2.self_attn_layer_norm.bias | 256 |
| decoder.layers.2.enc_attn_layer_norm.weight | 256 |
| decoder.layers.2.enc_attn_layer_norm.bias | 256 |
| decoder.layers.2.ff_layer_norm.weight | 256 |
| decoder.layers.2.ff_layer_norm.bias | 256 |
| decoder.layers.2.self_attention.fc_q.weight | 65536 |
| decoder.layers.2.self_attention.fc_q.bias | 256 |
| decoder.layers.2.self_attention.fc_k.weight | 65536 |
| decoder.layers.2.self_attention.fc_k.bias | 256 |
| decoder.layers.2.self_attention.fc_v.weight | 65536 |
| decoder.layers.2.self_attention.fc_v.bias | 256 |
| decoder.layers.2.self_attention.fc_o.weight | 65536 |
| decoder.layers.2.self_attention.fc_o.bias | 256 |
| decoder.layers.2.encoder_attention.fc_q.weight | 65536 |
| decoder.layers.2.encoder_attention.fc_q.bias | 256 |
| decoder.layers.2.encoder_attention.fc_k.weight | 65536 |
| decoder.layers.2.encoder_attention.fc_k.bias | 256 |
| decoder.layers.2.encoder_attention.fc_v.weight | 65536 |
| decoder.layers.2.encoder_attention.fc_v.bias | 256 |
| decoder.layers.2.encoder_attention.fc_o.weight | 65536 |
| decoder.layers.2.encoder_attention.fc_o.bias | 256 |
| decoder.layers.2.positionwise_feedforward.fc_1.weight | 131072 |
| decoder.layers.2.positionwise_feedforward.fc_1.bias | 512 |
| decoder.layers.2.positionwise_feedforward.fc_2.weight | 131072 |
| decoder.layers.2.positionwise_feedforward.fc_2.bias | 256 |
| decoder.fc_out.weight | 3217664 |
| decoder.fc_out.bias | 12569 |
+-------------------------------------------------------+------------+
Total Trainable Params: 12506137
So, that's why I am unable to load the model. Because the model has a different parameter in local.
Even if I try to load the weights in local it gives me-
model.load_state_dict(torch.load(f"{model_name}.pt", map_location=device))
Error-
--------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) <ipython-input-24-f5baac4441a5> in <module>
----> 1 model.load_state_dict(torch.load(f"{model_name}_2.pt", map_location=device))
c:\anaconda\envs\lang_trans\lib\site-packages\torch\nn\modules\module.py in load_state_dict(self, state_dict, strict)
845 if len(error_msgs) > 0:
846 raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
--> 847 self.__class__.__name__, "\n\t".join(error_msgs)))
848 return _IncompatibleKeys(missing_keys, unexpected_keys)
849
RuntimeError: Error(s) in loading state_dict for Seq2Seq: size mismatch for decoder.tok_embedding.weight: copying a param with shape torch.Size([12538, 256]) from checkpoint, the shape in current model is torch.Size([12569, 256]). size mismatch for decoder.fc_out.weight: copying a param with shape torch.Size([12538, 256]) from checkpoint, the shape in current model is torch.Size([12569, 256]). size mismatch for decoder.fc_out.bias: copying a param with shape torch.Size([12538]) from checkpoint, the shape in current model is torch.Size([12569]).
The model param of the local must be wrong because in colab (device=CPU, runtime=None) I am able to load the weights after defining model class. But in the local machine the params changes, so I am unable to load the weights. I know it's weird, help me to find the solution.
You can check the full code of the model here-
<script src="https://gist.github.com/Dipeshpal/90c715a7b7f00845e20ef998bda35835.js"></script>
https://gist.github.com/Dipeshpal/90c715a7b7f00845e20ef998bda35835
After this model params change.