I'm studying Temporal difference learning from this post. Here the update rule of TD(0) is clear to me but in TD(λ), I don't understand how utility values of all the previous states are updated in a single update.
Here is the diagram given for comparison of bot updates:
above diagram is explained as following:
In TD(λ) the result is propagated back to all the previous states thanks to the eligibility traces.
My question is how the information is propagated to all the previous states in a single update even if we are using following update rule with eligibility traces?
Here in a single update, we're only updating utility of a single state Ut(s), then how the utilities of all the previous states are getting updated?
Edit
As per the answer, it is clear that this update is applied for every single step and that's the reason why information is propagated. If this is the case, then it again confuses me because, the only difference between the update rules is eligibility trace.
So even if the value of eligibility trace is non zero for previous states, the values of delta will be zero in above case (because initially rewards and utility function is initialized 0). Then how is it possible for previous states to get other utility values than zero in first update?
Also in the given python implementation, following output is given after a single iteration:
[[ 0. 0.04595 0.1 0. ]
[ 0. 0. 0. 0. ]
[ 0. 0. 0. 0. ]]
Here only 2 values are updated instead of all 5 previous states as shown in the figure. What I'm missing here?


You are missing a small but important detail, the update rule is applied to all states, not only the current state. So, in practice, you are updating all the states whose
e_t(s)is different from zero.Edit
The
deltais not zero because is computed for the current state, when the episode ends and the agent receives a reward of +1. Therefore, after computing adeltadifferent from zero, you update all the states using thatdeltaand the current eligibility traces.I don't know why in the Python implementation (I haven't checked it carefully) the output updates only 2 values, but please verify that the eligibility traces for all the 5 previous states are different from 0, and if it's not the case, try to understand why. Sometimes you are not interested in maintaining the traces under a very small threshold (e.g., 10e-5), because it has a very small effect in the learning process and it's wasting computational resources.