For a reinforcement learning project that I'm working on, which is based on Deep Q-learning from Demonstrations (https://arxiv.org/pdf/1704.03732.pdf) the training process took a long time, because of a for loop in my loss function (this function is called in my loss function):
def QmaxExp(state,model,Expert_action,OutsideConditions,setpoint,Inputdf):
maxValue = -1000000
for i in range(len(Inputdf)):
Actions = np.array([Inputdf['col1'][i],Inputdf['col2'][i],Inputdf['col3'][i],Inputdf['col4'][i]])
conditions = np.array(list(OutsideConditions.values()))
Inputs = np.concatenate((state,conditions,[setpoint]))
Qvalue = model(Inputs.reshape((1,18)))[0,i]
Value = Qvalue + Lfunction(Actions,Expert_action) * 0.01
if Value > maxValue:
maxValue = Value
return maxValue
I changed the code to the following:
def QmaxExp(state, model, Expert_action, OutsideConditions, setpoint, Inputdf):
conditions = np.array(list(OutsideConditions.values()))
Inputs = np.concatenate((state, conditions, [setpoint]))
Qvalues = model(Inputs.reshape((1, 18)))[0]
Actions = np.array(Inputdf[['col1', 'col2', 'col3', 'col4']])
Lfunctie_value = (np.subtract(Actions,Expert_action)**2).sum(axis=1) #use np.sum(...) when using old function
Values = Qvalues + Lfunctie_value * 0.01
return np.max(Values)
This code is way faster (40 times) and the return maxValue (and loss value) is also the same, but in my training process the first QmaxExp converged to a loss of ~20, while the second is not converging at all and is around ~330.
with tf.GradientTape() as tape:
rest of the code...
loss += custom_loss(model,modelTarget,Current_state,Next_state,actions,Expert_action,Reward,gamma,OutsideConditions,Setpoint,Inputdf,HeatingInput,OtherInput)
lossNew += custom_lossNew(model,modelTarget,Current_state,Next_state,actions,Expert_action,Reward,gamma,OutsideConditions,Setpoint,Inputdf,HeatingInput,OtherInput)
# loss = tf.convert_to_tensor(loss)
loss = loss/Batch_size
lossNew = lossNew/Batch_size
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
The custom loss is defined as:
def custom_loss(model,modelTarget,state,new_state,action,Expert_action,reward,gamma,OutsideConditions,setpoint,Inputdf,HeatingInput,OtherInput):
JDQ = (reward + gamma* QmaxT1(modelTarget,new_state,OutsideConditions,setpoint) - Qvalue(state,action,model,OutsideConditions,setpoint,HeatingInput,OtherInput,Inputdf))**2
JE = QmaxExp(state,model,Expert_action,OutsideConditions,setpoint,Inputdf) - QExp(state,model,Expert_action,OutsideConditions,setpoint,HeatingInput,OtherInput,Inputdf)
JL2 = (reward - Qvalue(state,action,model,OutsideConditions,setpoint,HeatingInput,OtherInput,Inputdf))**2
lambda1 = 1
lambda2 = 1
lambda3 = 1
Loss = lambda1 * JDQ + lambda2 * JE + lambda3 * JL2
return Loss
And the L function is a simple sum of my array:
def Lfunction(action,Expert_action):
return np.sum(np.subtract(action,Expert_action)**2)#.sum(axis=1) #use np.sum(...) when using old function
Does anyone know what could go wrong in the vectorized QmaxExp?
The model architecture is an input layer, then 25 nodes and the output is 625 nodes. Not worked on a proper network yet, since speed was the main problem first
Thanks in advance :)