Why not Back propagate through time in LSTM , similar to RNN

156 Views Asked by At

I'm trying to implement RNN and LSTM , many-to-many architecture. I reasoned myself why BPTT is necessary in RNNs and it makes sense.

But what doesn't make sense to me is, most of resources I went through online for LSTM back prop (like attachment ) and this one : https://www.youtube.com/watch?v=8rQPJnyGLlY

Seems to be doing more of back propagation w.r.t current time stamp itself, but not through time , keeping variable convention aside, similar to RNN , I would assume that when calculating DL w.r.t Wf, since both current time step and previous hidden states are made up of Wf, both needs to added together, but most derivations I found are not doing that.

Specifically, taking coursers's notation, forward prop:

ft = sigmoid( Wf.(at-1, xt) + bf )
it = sigmoid(Wi.(at-1, xt) + bi )
ctil = Tanh(Wc.(at-1, xt) + bc )

ct = ft * ct-1 + it*ctil
ot = sigmoid(Wo.(at-1, xt) + bo )
at = ot * Tanh(ct)
yhat_t = Softmax(Wy.at + by )

Since at-1 and ot both have Wo in their equations, I would assume following derivation for dWo:

dWo = dat * (1-T(ct)**2 ) * ot* (1-ot )*(at-1,xt)-dot part + dotat-1 * dat-1ot-1 * dot-1Wo

Above derivation contains dot + dot-1. But derivation given by coursera, only contains dot and not dot-1.

And in the derivations of update and forget gates, what follows after plus sign, shouldn't they be t-1 and not t ?

So, I'm assuming LSTM back-prop doesn't involve BPTT , could someone please enlighten me on this ?

I'm specifically looking for the right theoretical derivation of BP in LSTMs, and if they involve BPTT in theory or not.

Any help is very much appreciated.

Thanks !

Coursera LSTM Backprop Derivation

1

There are 1 best solutions below

2
mrk On

There is no "right" theoretical backpropagation in LSTMs. They did evolve over time and change. I find this paper very useful for understanding LSTMs.

Following a quote from there (obviously in the paper you will also find the sources refernced there.

History of LSTM: The initial version of the LSTM block [14, 15] included (possibly multiple) cells, input and output gates, but no forget gate and no peephole connections. The output gate, unit biases, or input activation function were omitted for certain experiments. Training was done using a mixture of Real-Time Recurrent Learning (RTRL) [23, 24] and Backpropagation Through Time (BPTT) [24, 25].

Only the gradient of the cell was propagated back through time, and the gradient for the other recurrent connections was truncated. Thus, that study did not use the exact gradient for training. Another feature of that version was the use of full gate recurrence, which means that all the gates received recurrent inputs from all gates at the previous time-step in addition to the recurrent inputs from the block outputs. This feature did not appear in any of the later papers.

Hope this helps. Otherwise I would also recommend looking for help in another community. Or rephrase your problem into a coding problem.