Loss function doesn't converge after vectorization

60 Views Asked by At

For a reinforcement learning project that I'm working on, which is based on Deep Q-learning from Demonstrations (https://arxiv.org/pdf/1704.03732.pdf) the training process took a long time, because of a for loop in my loss function (this function is called in my loss function):

def QmaxExp(state,model,Expert_action,OutsideConditions,setpoint,Inputdf):
   maxValue = -1000000
   for i in range(len(Inputdf)):
      Actions = np.array([Inputdf['col1'][i],Inputdf['col2'][i],Inputdf['col3'][i],Inputdf['col4'][i]])
       conditions = np.array(list(OutsideConditions.values()))
       Inputs = np.concatenate((state,conditions,[setpoint]))
       Qvalue = model(Inputs.reshape((1,18)))[0,i]
       Value = Qvalue + Lfunction(Actions,Expert_action) * 0.01
       if Value > maxValue:
           maxValue = Value
   return maxValue

I changed the code to the following:

def QmaxExp(state, model, Expert_action, OutsideConditions, setpoint, Inputdf):
   conditions = np.array(list(OutsideConditions.values()))
   Inputs = np.concatenate((state, conditions, [setpoint]))
   Qvalues = model(Inputs.reshape((1, 18)))[0]
   Actions = np.array(Inputdf[['col1', 'col2', 'col3', 'col4']])
   Lfunctie_value = (np.subtract(Actions,Expert_action)**2).sum(axis=1) #use np.sum(...) when using old function
   Values = Qvalues + Lfunctie_value * 0.01
   return np.max(Values)

This code is way faster (40 times) and the return maxValue (and loss value) is also the same, but in my training process the first QmaxExp converged to a loss of ~20, while the second is not converging at all and is around ~330.

with tf.GradientTape() as tape:
    rest of the code...
    loss += custom_loss(model,modelTarget,Current_state,Next_state,actions,Expert_action,Reward,gamma,OutsideConditions,Setpoint,Inputdf,HeatingInput,OtherInput)
    lossNew += custom_lossNew(model,modelTarget,Current_state,Next_state,actions,Expert_action,Reward,gamma,OutsideConditions,Setpoint,Inputdf,HeatingInput,OtherInput)
        # loss = tf.convert_to_tensor(loss)
    loss = loss/Batch_size
    lossNew = lossNew/Batch_size
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

The custom loss is defined as:

def custom_loss(model,modelTarget,state,new_state,action,Expert_action,reward,gamma,OutsideConditions,setpoint,Inputdf,HeatingInput,OtherInput):
   JDQ = (reward + gamma*  QmaxT1(modelTarget,new_state,OutsideConditions,setpoint) - Qvalue(state,action,model,OutsideConditions,setpoint,HeatingInput,OtherInput,Inputdf))**2
   JE = QmaxExp(state,model,Expert_action,OutsideConditions,setpoint,Inputdf) - QExp(state,model,Expert_action,OutsideConditions,setpoint,HeatingInput,OtherInput,Inputdf)
   JL2 = (reward - Qvalue(state,action,model,OutsideConditions,setpoint,HeatingInput,OtherInput,Inputdf))**2

   lambda1 = 1
   lambda2 = 1
   lambda3 = 1

   Loss = lambda1 * JDQ + lambda2 * JE + lambda3 * JL2
   return Loss

And the L function is a simple sum of my array:

def Lfunction(action,Expert_action):
   return np.sum(np.subtract(action,Expert_action)**2)#.sum(axis=1) #use np.sum(...) when using old function

Does anyone know what could go wrong in the vectorized QmaxExp?

The model architecture is an input layer, then 25 nodes and the output is 625 nodes. Not worked on a proper network yet, since speed was the main problem first

Thanks in advance :)

0

There are 0 best solutions below