Model not being executed on Multiple GPUs when using Huggingface Seq2SeqTrainer with accelerate

110 Views Asked by At

I have a setup with a single node having 8 A100 GPUs.

  • python = 3.8.10
  • torch==2.0.1+cu117
  • accelerate==0.26.1
  • CUDA Version: 11.7

I am using AutoModelForSeq2SeqLM to load a model for finetuning and use Seq2SeqTrainer. The dataset is copied to multiple GPUs but the model is not being copied (as seen from memory usage using nvidia-smi). Could someone please explain what am I missing for DDP? I can see that 8 different losses are calculated during validation in logs. I am not able to find out why the computation is not being shared as the GPU utilization for 7 GPUs stays zero.

Code:

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, cache_dir=cache_dir,torch_dtype='torch.bfloat16', device_map='auto')

num_gpus = torch.cuda.device_count()

args = Seq2SeqTrainingArguments(
     output_dir=f"/outdir",
     evaluation_strategy=“epoch”,
     learning_rate=learning_rate,
     per_device_train_batch_size=batch_size//num_gpus,
     per_device_eval_batch_size=batch_size//num_gpus,
     weight_decay=0.01,
     save_total_limit=1,
     num_train_epochs=num_train_epochs,
     predict_with_generate=True,
     logging_steps=logging_steps,
     push_to_hub=False,
    )
trainer = Seq2SeqTrainer(
     model,
     args,
     train_dataset=tokenized_datasets[“train”],
     eval_dataset=tokenized_datasets[“validation”].select(range(32)),
     data_collator=data_collator,
     tokenizer=tokenizer,
     compute_metrics=compute_metrics,
    )

nvidia-smi output

0

There are 0 best solutions below