mat1 and mat2 must have the same dtype, but got Byte and Float

62 Views Asked by At

I am trying to implement a Deep Q-Network reinforcement learning agent for the game 2048. The issue I am coming across is mismatched datatypes during matrix multiplication with one matrix containing data of type Byte and the other of type Float.

I am using this gym environment - https://github.com/Quentin18/gymnasium-2048

And have followed the tensorflow documentation (https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html ) to set up a DQN agent

The two lines of code causing the issue are:

next_state_values\[non_final_mask\] = target_net(non_final_next_states).max(1).values

x = F.relu(self.layer1(x))

I assume target_values is mat1 and x is mat2 but I don't know this for certain.

I have tried to set both lines of code as float object using .float() at the end but it's the same error code

I am having issues with both debugging and printing; I have set it to 1 episode to debug however due to the number or transitions I can't get to the issue through debugging. I have also tried to print the data type of the variables however because it's inside the method I haven't been able to work out how I can get this to output. I use PyCharm.

I am unsure if the code will be helpful as there are multiple values involved in the matrix multiplication however will include the two methods i'm referring to.

Specifically, any guidance on the correct way to approach how to debug this would be fantastic. thank you.

class DQN(nn.Module):  # declare a deep q-network class
    def __init__(self, n_observations,
                 n_actions):  # constructor to initialise DQN taking the state space and actions as parameters
        super(DQN, self).__init__()  # calls constructor to nn.Module to properly initialise the DQN
        # define three fully connected layers of an NN
        self.layer1 = nn.Linear(16, 256)  # state space as input and outputs 128 features
        self.layer2 = nn.Linear(256, 256)  # input = 128 feat output = 128 feat
        self.layer3 = nn.Linear(256, n_actions)  # 128 feat as input outputs 4 features - actions for the env

    # defines how data flows through the network layer
    def forward(self, x):  # x = input state/batch of states
        x = F.relu(self.layer1(x)).float()  # apply ReLU activation function to layer 1 output
        x = F.relu(self.layer2(x))  # apply ReLU activation function to layer 2 output
        return self.layer3(x)  # returns Q-values for each action in given state

def optimize_model():
    if len(memory) < BATCH_SIZE:  # checks if enough transitions are stored to form a batch
        return
    transitions = memory.sample(BATCH_SIZE)  # samples a batch of transitions from replay memory
    batch = Transition(*zip(*transitions))  # converts batch-array of transitions to transition of batch-arrays
    # organises the batch so each component (state, action, reward, n_state) is seperated for easy access

    # prepare batch components to feed into an NN
    non_final_mask = torch.tensor(tuple(map(lambda s: s is not None, batch.n_state)),
                                  dtype=torch.bool)  # boolean mask to indicate which n_states are not final states
    non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None])  # concatenates non-final n_states
    state_batch = torch.stack(
        [torch.tensor(s) for s in batch.state])  # Convert to tensors & concatenate states & add dimension
    action_batch = torch.stack([torch.tensor(s) for s in batch.action]).unsqueeze(
        1)  # Convert to tensors & concatenate actions & add dimension
    reward_batch = torch.stack([torch.tensor(s) for s in batch.reward])  # Convert to tensors & concatenate rewards

    state_action_values = policy_net(state_batch).gather(1,
                                                         action_batch)  # computes Q-values for state-actions pairs in
                                                                        # batch using policy network

    # computes expected Q-values for next states using target network, maximum Q-value for each non-final n_state
    next_state_values = torch.zeros(BATCH_SIZE)
    with torch.no_grad():
        next_state_values[non_final_mask] = target_net(non_final_next_states).max(1).values.float()
    # Compute the expected Q values using Bellman equation
    expected_state_action_values = (next_state_values * DISCOUNT_FACTOR) + reward_batch

    # Create huber loss function
    criterion = nn.SmoothL1Loss()
    # predicted Q-values by model, expected Q-values for state-action pairs, .unsqueeze to add extra dimension
    loss = criterion(state_action_values,
                     expected_state_action_values.unsqueeze(1))  # loss computed by predicted vs expected Q-values
    optimizer.zero_grad()  # clears parameters of the model
    loss.backward()  # computes new parameters with respect to the loss
    torch.nn.utils.clip_grad_value_(policy_net.parameters(),
                                    100)  # prevents parameters growing too large during training
    optimizer.step()  # update model parameters
1

There are 1 best solutions below

1
Victor Björkgren On

What Karl said should do it.

Or change this row:

 non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None])

to

 non_final_next_states = torch.tensor([s for s in batch.n_state if s is not None], dtype=torch.float)

The root cause should be that your transitions are in byte dtype though. And switching back and forth will have a toll on performance. So, make sure you save your transitions with dtype float. Both mine and Karl's suggestions could be left in as fail safes if torch decides to mess with your dtype though.