Understanding the implementation of T-Net in the PointNet model

27 Views Asked by At

I reading the PointNet paper and I am trying to understand how I should implement the T-Net block of the model (it is the same idea for both input and feature transform). All the PyTorch implementations I have looked at, do the same thing this snippet is taken from here:

class STNkd(nn.Module):
    def __init__(self, k=64):
        super(STNkd, self).__init__()
        self.conv1 = torch.nn.Conv1d(k, 64, 1)
        self.conv2 = torch.nn.Conv1d(64, 128, 1)
        self.conv3 = torch.nn.Conv1d(128, 1024, 1)
        self.fc1 = nn.Linear(1024, 512)
        self.fc2 = nn.Linear(512, 256)
        self.fc3 = nn.Linear(256, k*k)
        self.relu = nn.ReLU()

        self.bn1 = nn.BatchNorm1d(64)
        self.bn2 = nn.BatchNorm1d(128)
        self.bn3 = nn.BatchNorm1d(1024)
        self.bn4 = nn.BatchNorm1d(512)
        self.bn5 = nn.BatchNorm1d(256)

        self.k = k

    def forward(self, x):
        batchsize = x.size()[0]
        x = F.relu(self.bn1(self.conv1(x)))
        x = F.relu(self.bn2(self.conv2(x)))
        x = F.relu(self.bn3(self.conv3(x)))
        x = torch.max(x, 2, keepdim=True)[0]
        x = x.view(-1, 1024)

        x = F.relu(self.bn4(self.fc1(x)))
        x = F.relu(self.bn5(self.fc2(x)))
        x = self.fc3(x)

        iden = Variable(torch.from_numpy(np.eye(self.k).flatten().astype(np.float32))).view(1,self.k*self.k).repeat(batchsize,1)
        if x.is_cuda:
            iden = iden.cuda()
        x = x + iden
        x = x.view(-1, self.k, self.k)
        return x

I am able to follow all the steps, until the iden part. Basically, if I understand correctly, they add an identity matrix to the regressed one. Does this make sense?

According to the paper:

  1. The output matrix is initialized as an identity matrix.
  2. The feature transformation matrix to be close to orthogonal matrix

The second point is clear. Just add (I - xx^T)^2 in the final loss. The first point seems unclear to me. I think the word initialized is a bad one, since this matrix is actually predicted by the network. Adding at each step (even at inference) an identity matrix, doesn't sound like an initialization.

Can someone fill these gaps between the paper and the implementation?

Some of the PyTorch implementations I have checked:

  1. Implementation No 1
  2. Implementation No 2
0

There are 0 best solutions below