My dataset code
# just load image rescale it, to tensor and process annotation coord
def load_coord_data(img_path, anno_path, h, w):
img = cv2.imread(img_path, cv2.IMREAD_COLOR)
scale = img.shape[0] / h
img = cv2.resize(img, (w, h))
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = img2tensor(img)
with open(anno_path, 'r') as f:
anno = json.loads(f.read())
coords = np.zeros((len(50), 2), dtype=np.float32) # 50 classes' x, y coords
for idx in anno:
if _class in anno:
coords[idx, 0], coords[idx, 1] = anno[idx]['x'], anno[idx]['y']
else:
coords[idx, 0], coords[idx, 1] = -1.0, -1.0
return img, coords
class KeyPointsDataset(Dataset):
def __init__(self, h, w, input_dir="xxx"):
self.h, self.w = h, w
files = os.listdir(input_dir)
self.img_files = sorted([os.path.join(input_dir, fn) for fn in files if fn.endswith("jpg")])
self.anno_files = sorted([os.path.join(input_dir, fn) for fn in files if fn.endswith("json")])
def __getitem__(self, idx):
img, labels = load_coord_data(self.img_files[idx], self.anno_files[idx], self.h, self.w)
return img, labels
def __len__(self):
return len(self.img_files)
This is my training code:
dataset = KeyPointsDataset(h, w)
dataloader = DataLoader(
dataset,
batch_size=1,
shuffle=True,
num_workers=1,
drop_last=False,
pin_memory=True
)
loss = th.nn.MSELoss()
device = th.device('mps')
self.model = self.model.to(device)
self.model.train()
for epoch in range(10):
for step, (img, labels) in enumerate(dataloader):
img, labels = img.to(device, non_blocking=True), labels.to(device, non_blocking=True)
# the time that dataloder took
if step > 0:
dataloader_time = round(time.monotonic() - toc, 2)
else:
dataloader_time = -1
tic = time.monotonic()
pred = self.model(img)
_loss = loss(pred, labels)
_loss.backward()
self.opt.step()
self.opt.zero_grad()
toc = time.monotonic()
if step % self.config.LOG_STEPS == 0:
print('Epoch {:03d} | Step {:05d} | Step Loss {:.6} | Train time {} | Dataloader time {}'.format(
epoch, step, float(_loss.cpu().detach().numpy()), round(toc - tic, 2), dataloader_time))
When I train my mode using MPS, the dataloader is fast at the first few steps but slows down afterwards:
Epoch 000 | Step 00000 | Step Loss 0.828719 | Train time 3.6 | Dataloader time -1.0
Epoch 000 | Step 00001 | Step Loss 0.708257 | Train time 1.09 | Dataloader time 0.13
Epoch 000 | Step 00002 | Step Loss 0.658343 | Train time 1.15 | Dataloader time 1.27
Epoch 000 | Step 00003 | Step Loss 0.493049 | Train time 1.8 | Dataloader time 2.02
Epoch 000 | Step 00004 | Step Loss 2.28905 | Train time 1.36 | Dataloader time 4.77
Epoch 000 | Step 00005 | Step Loss 0.322044 | Train time 2.05 | Dataloader time 3.58
Epoch 000 | Step 00006 | Step Loss 0.535195 | Train time 1.61 | Dataloader time 5.31
Epoch 000 | Step 00007 | Step Loss 0.647095 | Train time 1.93 | Dataloader time 4.69
Epoch 000 | Step 00008 | Step Loss 0.572585 | Train time 2.03 | Dataloader time 4.75
Epoch 000 | Step 00009 | Step Loss 0.533676 | Train time 5.66 | Dataloader time 6.76
Epoch 000 | Step 00010 | Step Loss 0.569616 | Train time 2.16 | Dataloader time 8.46
Epoch 000 | Step 00011 | Step Loss 0.527826 | Train time 1.95 | Dataloader time 6.09
Epoch 000 | Step 00012 | Step Loss 0.429697 | Train time 2.89 | Dataloader time 5.06
Epoch 000 | Step 00013 | Step Loss 0.463338 | Train time 3.53 | Dataloader time 7.06
Epoch 000 | Step 00014 | Step Loss 0.573107 | Train time 3.31 | Dataloader time 7.57
Epoch 000 | Step 00015 | Step Loss 0.664436 | Train time 2.01 | Dataloader time 6.17
Epoch 000 | Step 00016 | Step Loss 0.420959 | Train time 1.76 | Dataloader time 5.49
Epoch 000 | Step 00017 | Step Loss 0.366839 | Train time 5.88 | Dataloader time 5.55
However, if I switch to CPU (changing the device to th.device('cpu')), the dataloader is fast and consistant:
Epoch 000 | Step 00000 | Step Loss 0.768135 | Train time 6.0 | Dataloader time -1
Epoch 000 | Step 00001 | Step Loss 0.912373 | Train time 4.49 | Dataloader time 0.0
Epoch 000 | Step 00002 | Step Loss 0.678868 | Train time 5.33 | Dataloader time 0.0
Epoch 000 | Step 00003 | Step Loss 0.518494 | Train time 5.62 | Dataloader time 0.0
Epoch 000 | Step 00004 | Step Loss 0.647296 | Train time 5.34 | Dataloader time 0.0
Epoch 000 | Step 00005 | Step Loss 0.621026 | Train time 4.64 | Dataloader time 0.0
Epoch 000 | Step 00006 | Step Loss 0.611825 | Train time 5.24 | Dataloader time 0.0
Epoch 000 | Step 00007 | Step Loss 0.557198 | Train time 4.31 | Dataloader time 0.0
Epoch 000 | Step 00008 | Step Loss 0.341876 | Train time 5.15 | Dataloader time 0.0
Epoch 000 | Step 00009 | Step Loss 0.425114 | Train time 5.41 | Dataloader time 0.0
Epoch 000 | Step 00010 | Step Loss 0.526096 | Train time 5.83 | Dataloader time 0.0
Epoch 000 | Step 00011 | Step Loss 0.541208 | Train time 4.14 | Dataloader time 0.0
I've checked the time taken by img, labels = img.to(self.device, non_blocking=True), labels.to(self.device, non_blocking=True) for MPS, its tiny compared to the dataloader time:
...
t1 = time.time()
img, labels = img.to(device, non_blocking=True), labels.to(device, non_blocking=True)
print("to mps time: ", time.time() - t1)
...
It gives:
to mps time: 0.0009737014770507812
Epoch 000 | Step 00000 | Step Loss 0.779288 | Train time 3.54 | Dataloader time -1
to mps time: 0.0029366016387939453
Epoch 000 | Step 00001 | Step Loss 0.81939 | Train time 1.12 | Dataloader time 0.39
to mps time: 0.0012209415435791016
Epoch 000 | Step 00002 | Step Loss 0.755161 | Train time 1.22 | Dataloader time 1.94
to mps time: 0.0012700557708740234
Epoch 000 | Step 00003 | Step Loss 0.443458 | Train time 3.04 | Dataloader time 2.13
to mps time: 0.0002002716064453125
Epoch 000 | Step 00004 | Step Loss 7.6309 | Train time 3.05 | Dataloader time 6.62
to mps time: 0.0003991127014160156
Epoch 000 | Step 00005 | Step Loss 0.802644 | Train time 1.99 | Dataloader time 6.63
to mps time: 0.0002162456512451172
Epoch 000 | Step 00006 | Step Loss 0.690297 | Train time 1.95 | Dataloader time 5.55
to mps time: 0.00043082237243652344
Epoch 000 | Step 00007 | Step Loss 0.805149 | Train time 3.16 | Dataloader time 6.13
to mps time: 0.00084686279296875
Epoch 000 | Step 00008 | Step Loss 0.729108 | Train time 1.84 | Dataloader time 6.87
to mps time: 0.00031113624572753906
Epoch 000 | Step 00009 | Step Loss 0.575548 | Train time 1.73 | Dataloader time 4.74
I don't know what slows down the dataloader when using MPS, and why it is much faster with CPU? It seems that the to(device) process is not the problem.
It turns out that most of the time is from
_loss.cpu().detach().numpy()which synchronizing the GPU. In my case, make batch size smaller can relieve the problem. see: https://discuss.pytorch.org/t/calling-loss-item-is-very-slow/99774