I've been studying a block of code with simple and straightforward DQN implementation but I'm having trouble understanding a core process of the implementation.
By inserting batches of 64, they compute the current Q values for each sample, the current target Q value and they compute the final td-error. How can we train the neural network with just one single value? From my understanding of back-propagation, we need the complete values of our current and target network, and we need the loss function differences for each output value. Meaning if we have 4 actions (output neurons) we should have a total of 4x64 values to train our model with. How is it possible to train the model with just one single value, how does it know what to change if we provide no indication on which outputs where far off from the target?
They also just take the mean of all 64 errors and train the model accordingly, which just further reduces the information provided to the network on how to change its parameters accordingly.
What am I missing?
Code for reference:
td_error = q_t_selected - tf.stop_gradient(q_t_selected_target) # Q(s,a;θi) - ( r + gamma * maxQ(s',a';θi-) )
errors = U.huber_loss(td_error)
weighted_error = tf.reduce_mean(importance_weights_ph * errors)
optimizer.minimize(weighted_error, var_list=q_func_vars)