How to make the Trainer (transformer) load the data batch by batch during training?

18 Views Asked by At

I'm trying to train an LLM (mt5-XL) using the transformer library, but i keep getting the error

torch.cuda.OutOfMemoryError: CUDA out of memory

Even though i have 80gb RAM and this model should only need about 48gb according to (https://huggingface.co/spaces/hf-accelerate/model-memory-usage)

So i figured this should be due to the space taken by the data (100k pairs of queries and document). So i thought if i can load the data batch by batch instead of load the whole dataset will solve my situation.

This is how the code is looking now:

train_dataset = IndexingTrainDataset(path_to_data="path_to_train_dataset.json",
                                             max_length=256,
                                             cache_dir='cache',
                                             tokenizer=tokenizer)

valid_dataset = IndexingTrainDataset(path_to_data="path_to_dev_dataset.json",
                                             max_length=256,
                                             cache_dir='cache',
                                             remove_prompt=True,
                                             tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    data_collator=IndexingCollator(
        tokenizer,
        padding='longest',
    ),
    compute_metrics=make_compute_metrics(tokenizer, train_dataset.valid_ids),
    restrict_decode_vocab=restrict_decode_vocab,
    id_max_length=256
)

IDK how to give the path to the dataset to the trainer and let it load the data batch by batch (if that's possible)

0

There are 0 best solutions below