How to normalize a batch of tensors in pytorch in a speed optimized manner

118 Views Asked by At

I have a input batch which is a list (size 8) of images (480,640,3), which I would like to convert to Pytorch tensors, normalize with mean and std, and pass to a model as (8,3,480,640). Presently I'm doing the following, which works.

import torch as T
from torchvision import transforms

batch_size=8
height = 480
width = 640
input_shape = (batch_size, 3, height, width)

transform = transforms.Compose([              
    transforms.ToTensor(),
    transforms.Normalize([0.485, 0.456, 0.406],
                        [0.229, 0.224, 0.225])
])


input_batch = [...] # np.ones(480,640,3) * 8

pre_items = [transform(item) for item in input_batch]
pre_items = T.stack(pre_items).to("cuda")

This is obviously not optimal because the preprocessing happens on CPU before being moved to CUDA.

What's the correct way to perform this on GPU on the batch as a whole?

My attempt at a solution was:

import torch as T

batch_size = 8
height = 480
width = 640

mean = T.ones((batch_size, height, width, 3)).to("cuda") * T.tensor([0.485, 0.456, 0.406]).to("cuda")
std = T.ones((batch_size, height, width, 3)).to("cuda") * T.tensor([0.229, 0.224, 0.225]).to("cuda")

input_batch = T.stack([T.tensor(item).to("cuda").float() for item in input_batch])
pre_items = (input_batch - mean)/std
pre_items = T.permute(pre_items, (0,3,1,2))

The output of this script does not match the expected tensor from the bottlenecked solution.

3

There are 3 best solutions below

1
Klops On BEST ANSWER

According to OPs clarification, this is a speedy way to peform the normalization on the gpu.

  • Note: I reshape the mean and std variable so that I can multiply it with input_batch without stacking the same value multiple times (this is called broadcasting).
import torch as T
import numpy as np

batch_size = 8
height = 480
width = 640
channels = 3

# CPU
input_batch = np.ones((batch_size, height, width, channels))
mean = np.array([0.485, 0.456, 0.406]).reshape((1, 1, 1, -1))  # match input_batch dimension
std = np.array([0.229, 0.224, 0.225]).reshape((1, 1, 1, -1))  # match input_batch dimension

# GPU
input_batch = T.from_numpy(input_batch).to("cuda")
mean = T.from_numpy(mean).to("cuda")
std = T.from_numpy(std).to("cuda")

pre_items = (input_batch - mean)/std

1
Klops On

I think your approach is very good and performing pre-processing on CPU is something you want! While the GPU is busy doing heavy calculations, you can use multiple workers on your CPU to prepare the next batch(es) of data.

Of course you could pre-process your dataset at once and save it to disc. Looking at the types of pre-processing I would not recommend it. Many transformations are better as online-transformation (augmentations) or cannot be avoided (ToTensor()) at all.

If you are not happy with the speed (e.g. you GPU is not utilized by 100% because data preparation is a bottle neck), you should look into the parallization of CPU data preprocessing (docs) by using num_workers parameter:

torch.utils.data.DataLoader(
   imagenet_data,
   batch_size=4,
   shuffle=True,
   num_workers=8
)

You can also look into the pin_memory=True parameter, as this tells the dataloader to move the data directly to GPU (after loading and pre-processing) so that no time is wasted with transfer when the data is needed.

0
noobvorld On

With @Klops help, here's the complete solution:

import torch as T

batch_size = 8
height = 480
width = 640

mean = np.array([0.485, 0.456, 0.406]).reshape((1, 1, 1, -1)).astype("float32")  # match input_batch dimension
std = np.array([0.229, 0.224, 0.225]).reshape((1, 1, 1, -1)).astype("float32")

mean = T.from_numpy(mean).to("cuda")
std = T.from_numpy(std).to("cuda")

# N* H,W,C 
input_batch = [np.ones((height, width, channels))]*batch_size

# N,H,W,C
input_batch = T.stack([T.from_numpy(item).to("cuda").float()/255.0 for item in input_batch])
input_batch = (input_batch - mean)/std

# N,C,H,W
input_batch = T.permute(input_batch, (0,3,1,2)).contiguous()

When compared to torchvision.transforms.ToTensor(), there are slight variations in each item after the following line: input_batch = T.stack([T.from_numpy(item).to("cuda").float()/255.0 for item in input_batch])

This variation was found to be small when tested with T.allclose().