Pytorch: Forward pass single time with full data and multiple times with subset of data, are they identical?

62 Views Asked by At

I am optimizing my neural network (LSTM network) using PyTorch. For some reasons (during training), I cannot pass all data at once but only able to pass a subset of data (please see the code below). In both approaches, let assume that I ONLY update the network weights after all x data (x mini batch data) have passed. My question is that for optimization of the network weights, are the two approaches identical.

import torch

# Create input
x = {}
x["one"] = torch.rand(10,2)
x["two"] = torch.rand(7,2)

y = {}
y["one"] = torch.rand(10,1)
y["two"] = torch.rand(7,1)

# LSTM model
model = torch.nn.LSTM(input_size=2, hidden_size=1)

# Train the model
optim = torch.optim.Adam(model.parameters(), lr=0.001)
loss = torch.nn.L1Loss()

#------------------------------------------------------------------------------
#   First approach
#------------------------------------------------------------------------------
y_true = torch.cat((y["one"], y["two"]), dim = 0)
for epoch in range(3):
    y_predict = {}
    for key in x.keys():
        y_predict[key], _ = model(x[key])
    
    # reset gradient to zero
    optim.zero_grad()
    
    # convert y predict to torch tensor
    y_predict = torch.cat((y_predict["one"], y_predict["two"]), dim = 0)
    
    # calculating loss and update weights
    L1loss = loss(y_true, y_predict)
    L1loss.backward()
    optim.step()
    
#------------------------------------------------------------------------------
# Second approach
#------------------------------------------------------------------------------

input = torch.cat((x["one"], x["two"]), dim = 0)
for epoch in range(3):
    y_predict, _ = model(input)

    # reset gradient to zero
    optim.zero_grad()

    # calculating loss and update weights
    L1loss = loss(y_true, y_predict)
    L1loss.backward()
    optim.step()
1

There are 1 best solutions below

0
tamnva On

I am trying to answer my own question. My answer would be NO because

Every forward pass, PyTorch build a computation graph and save intermediate activations (outputs between layers), which are needed for backward pass .backward(). In the second approach, I think the first intermediate activation will be overwritten by the second intermediate activations which created in the second forward pass. Then the model will only used the intermediate activation of the second pass to for the .backward()