Suppose we have an audio classification task (AudioMNIST).
My pipeline and other pipelines I’ve seen consist of the next steps:
- Read the dataset (the data samples).
- Do the base transforms (merge the audio channels, change the bitrate, etc).
- Split the dataset into the train one, the test one, etc.
- Do the main transforms (different for the train and the test) such as the augmentation.
- Batch (along with the sampling).
- Pad/Truncate the batch samples.
- Do the forward pass with the batch.
- <…>
I saw the scheme:
- Dataset or a subclass - pp. 1., 2., 3., 4.
- Collator - p. 6.
Either:
- Dataset or a subclass - p. 1.
- somebody else - pp. 2., 3., 4.
- Collator - p. 6.
Or:
- Dataset or a subclass - p. 1.
- somebody else - p. 3.
- Collator - pp. 2., 4., 6.
What should the collator do and what shouldn’t? (The main question.) What is the correct scheme?
You've tagged this with pytorch, so I'll give the pytorch answer.
Pytorch data utils has a
Datasetand aDataLoader. tl;dr, theDatasethandles loading a single example, while theDataLoaderhandles batching and any bulk processing.The
Datasethas two methods,__len__for determining the number of items in the dataset and__getitem__for loading a single item.The
DataLoaderis passed a list of outputs from theDataset(iebatch_input = [dataset.__getitem__(i) for i in idxs]). The batch input is sent to thecollate_fnof theDataLoader.In terms of thinking about what to do where, the
Datasetshould handle loading single examples. TheDatasetwill be called in parallel, so tasks that are CPU-bound should go in theDataset. Loading from disk (if applicable) is also typically done in theDataset.The
collate_fnhandles converting a list of outputs from yourDatasetinto whatever format your model wants. Since theDataLoaderdeals with a batch of data, it can be more efficient to apply batch processing steps. Stacking tensors, padding to length, generating masks or other bulk tensor ops work well in thecollate_fn.In general, think of the
Datasetas running multi-process on single examples, while theDataLoaderrunning a single-process on a batch of examples.