What should collator do exactly?

27 Views Asked by Kuraga At 12 March 2024 at 13:56

Suppose we have an audio classification task (AudioMNIST).

My pipeline and other pipelines I’ve seen consist of the next steps:

Read the dataset (the data samples).
Do the base transforms (merge the audio channels, change the bitrate, etc).
Split the dataset into the train one, the test one, etc.
Do the main transforms (different for the train and the test) such as the augmentation.
Batch (along with the sampling).
Pad/Truncate the batch samples.
Do the forward pass with the batch.
<…>

I saw the scheme:

Dataset or a subclass - pp. 1., 2., 3., 4.
Collator - p. 6.

Either:

Dataset or a subclass - p. 1.
somebody else - pp. 2., 3., 4.
Collator - p. 6.

Or:

Dataset or a subclass - p. 1.
somebody else - p. 3.
Collator - pp. 2., 4., 6.

What should the collator do and what shouldn’t? (The main question.) What is the correct scheme?

Original Q&A

There are 1 best solutions below

Karl On 13 March 2024 at 16:59

You've tagged this with pytorch, so I'll give the pytorch answer.

Pytorch data utils has a Dataset and a DataLoader. tl;dr, the Dataset handles loading a single example, while the DataLoader handles batching and any bulk processing.

The Dataset has two methods, __len__ for determining the number of items in the dataset and __getitem__ for loading a single item.

class MyDataset(Dataset):
    def __init__(self):
        ...

    def __len__(self):
        ...

    def __getitem__(self, index):
        ...

The DataLoader is passed a list of outputs from the Dataset (ie batch_input = [dataset.__getitem__(i) for i in idxs]). The batch input is sent to the collate_fn of the DataLoader.

def my_collate_fn(batch):
    ...

dataloader = DataLoader(my_dataset, batch_size, collate_fn=my_collate_fn)

In terms of thinking about what to do where, the Dataset should handle loading single examples. The Dataset will be called in parallel, so tasks that are CPU-bound should go in the Dataset. Loading from disk (if applicable) is also typically done in the Dataset.

The collate_fn handles converting a list of outputs from your Dataset into whatever format your model wants. Since the DataLoader deals with a batch of data, it can be more efficient to apply batch processing steps. Stacking tensors, padding to length, generating masks or other bulk tensor ops work well in the collate_fn.

In general, think of the Dataset as running multi-process on single examples, while the DataLoader running a single-process on a batch of examples.

What should collator do exactly?

There are 1 best solutions below

Related Questions in DEEP-LEARNING

Related Questions in PYTORCH

Related Questions in NEURAL-NETWORK

Related Questions in COLLATION

Trending Questions

Popular # Hahtags

Popular Questions