Simple ResNet model can not tell if two monotone images are the same color

28 Views Asked by At

I was experimenting with a problem training an image comparison model. I simplified it to the following problem.

I give pairs of images (3x128x128) to a model. The images are either completely black or completely white. The model takes both images through separate resnet models and concatenates the outputs which then go through a fully connected layer. It should return 1.0 if both images are the same color (both black or both white) and 0.0 otherwise. However, the model converges to always assigning ~0.5 even though this task should be simple.

The model:

class TemplateEvaluator(nn.Module):
    def __init__(self, q_encoder=resnet18(), t_encoder=resnet18()):
        super(TemplateEvaluator, self).__init__()
        self.q_encoder = q_encoder
        self.t_encoder = t_encoder
        
        # Set requires_grad to True to train resnet
        for param in self.q_encoder.parameters():
            param.requires_grad = True
        for param in self.t_encoder.parameters():
            param.requires_grad = True
        
        self.fc = nn.Sequential(
            nn.Linear(2000, 1),
            nn.Sigmoid()
        )
    
    def forward(self, data):
        q = data[0]
        t = data[1]
        
        # If singular images:
        if q.ndim == 3: q = q.unsqueeze(0)
        if t.ndim == 3: t = t.unsqueeze(0)
        
        q = self.q_encoder(q)
        t = self.t_encoder(t)
        res = self.fc(torch.cat([q,t],-1)).flatten()
        return res

The dataloader:

class BlackOrWhiteDataset(Dataset):
    def __init__(self):
        self.tf = transforms.ToTensor()

    def __getitem__(self, i):
        black = (255,255,255)
        white = (0,0,0)
        
        x1_col = black if (np.random.random() > 0.5) else white
        x2_col = black if (np.random.random() > 0.5) else white
        y = torch.tensor(x1_col == x2_col, dtype=torch.float)

        x1 = Image.new('RGB', (img_width,img_width), x1_col)
        x2 = Image.new('RGB', (img_width,img_width), x2_col)
        return self.tf(x1), self.tf(x2), y
    
    def __len__(self):
        return 100

def create_data_loader(dataset, batch_size, verbose=True):
    dl = torch.utils.data.DataLoader(dataset, batch_size=batch_size, shuffle=True,
                                     collate_fn=lambda x: tuple(x_.to(device) for x_ in default_collate(x)))
    return dl

The training:

t_eval = TemplateEvaluator().to(device)
opt = optim.SGD(t_eval.parameters(), lr=0.001, momentum=0.01)

epochs = 10
losses = []

for epoch in tqdm(range(epochs)):
    t_eval.train()
    
    for X1, X2, Y in dl:
        Y_pred = t_eval(torch.stack([X1,X2]))
        loss = F.mse_loss(Y_pred,Y)
        
        opt.zero_grad()
        loss.backward()
        opt.step()
        
        sys.stdout.write('\r')
        sys.stdout.write("loss: %f" % loss.item())
        sys.stdout.flush()
        
        losses.append(loss.item())

plt.plot(losses)
plt.ylim(0,1)

And the results:

  0%|          | 0/10 [00:00<?, ?it/s]
loss: 0.259106
 10%|█         | 1/10 [00:01<00:13,  1.54s/it]
loss: 0.241787
 20%|██        | 2/10 [00:02<00:11,  1.40s/it]
loss: 0.258519
 30%|███       | 3/10 [00:04<00:09,  1.36s/it]
loss: 0.250100
 40%|████      | 4/10 [00:05<00:08,  1.35s/it]
loss: 0.257565
 50%|█████     | 5/10 [00:06<00:06,  1.35s/it]
loss: 0.264662
 60%|██████    | 6/10 [00:08<00:05,  1.35s/it]
loss: 0.246792
 70%|███████   | 7/10 [00:09<00:04,  1.34s/it]
loss: 0.260988
 80%|████████  | 8/10 [00:10<00:02,  1.34s/it]
loss: 0.241590
 90%|█████████ | 9/10 [00:12<00:01,  1.34s/it]
loss: 0.250159
100%|██████████| 10/10 [00:13<00:00,  1.35s/it]

enter image description here

Example case:

t_eval.eval()

for X1, X2, Y in dl:
    view([X1[0],X2[0]])
    print(Y[0].item())
    print(t_eval(torch.stack([X1[0],X2[0]])).item())
    break

gives:

enter image description here

or:

enter image description here

When setting 'Y' to be only zeros, the model does converge such that Y_pred approaches zero. So the optimizer is working. When setting 'Y' to indicate if the first image is black, the model does converge as expected. Same for the second image. So the model can interpret both images individually.

Thus, it seems that the model fails to combine the information in both inputs and I do not see why.

Update

Thanks to user23818208 I found a solution.

A single-layer perception cannot compute equality. This is known as the XOR/NXOR problem. Instead of combining the image features through concatenation, I now perform element-wise multiplication like so:

class TemplateEvaluator(nn.Module):
    def __init__(self, q_encoder=resnet18(), t_encoder=resnet18()):
        super(TemplateEvaluator, self).__init__()
        self.q_encoder = q_encoder
        self.t_encoder = t_encoder
        
        self.fc = nn.Sequential(
            nn.Linear(1000, 1),
            nn.Sigmoid()
        )
    
    def forward(self, data):
        q = data[0]
        t = data[1]
        
        if q.ndim == 3: q = q.unsqueeze(0)
        if t.ndim == 3: t = t.unsqueeze(0)
        
        q_features = self.q_encoder(q)
        t_features = self.t_encoder(t)
        
        combined_features = q_features * t_features
        
        res = self.fc(combined_features).flatten()
        return res

The model now converges:

  0%|          | 0/10 [00:00<?, ?it/s]
loss: 0.065883
 10%|█         | 1/10 [00:01<00:16,  1.89s/it]
loss: 0.002977
 20%|██        | 2/10 [00:03<00:14,  1.76s/it]
loss: 0.000158
 30%|███       | 3/10 [00:05<00:12,  1.74s/it]
loss: 0.000015
 40%|████      | 4/10 [00:06<00:10,  1.71s/it]
loss: 0.000003
 50%|█████     | 5/10 [00:08<00:08,  1.71s/it]
loss: 0.000002
 60%|██████    | 6/10 [00:10<00:06,  1.70s/it]
loss: 0.000001
 70%|███████   | 7/10 [00:12<00:05,  1.70s/it]
loss: 0.000001
 80%|████████  | 8/10 [00:13<00:03,  1.69s/it]
loss: 0.000000
 90%|█████████ | 9/10 [00:15<00:01,  1.70s/it]
loss: 0.000000
100%|██████████| 10/10 [00:17<00:00,  1.71s/it]

enter image description here

1

There are 1 best solutions below

2
user23818208 On

Your model to check for equality has a single dense layer. A single layer perceptron however can not learn the XOR function and by extension XNOR (which is equality), this is quite a famous result from early machine learning history.