I have a setup with a single node having 8 A100 GPUs.
- python = 3.8.10
- torch==2.0.1+cu117
- accelerate==0.26.1
- CUDA Version: 11.7
I am using AutoModelForSeq2SeqLM to load a model for finetuning and use Seq2SeqTrainer. The dataset is copied to multiple GPUs but the model is not being copied (as seen from memory usage using nvidia-smi). Could someone please explain what am I missing for DDP? I can see that 8 different losses are calculated during validation in logs. I am not able to find out why the computation is not being shared as the GPU utilization for 7 GPUs stays zero.
Code:
model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint, cache_dir=cache_dir,torch_dtype='torch.bfloat16', device_map='auto')
num_gpus = torch.cuda.device_count()
args = Seq2SeqTrainingArguments(
output_dir=f"/outdir",
evaluation_strategy=“epoch”,
learning_rate=learning_rate,
per_device_train_batch_size=batch_size//num_gpus,
per_device_eval_batch_size=batch_size//num_gpus,
weight_decay=0.01,
save_total_limit=1,
num_train_epochs=num_train_epochs,
predict_with_generate=True,
logging_steps=logging_steps,
push_to_hub=False,
)
trainer = Seq2SeqTrainer(
model,
args,
train_dataset=tokenized_datasets[“train”],
eval_dataset=tokenized_datasets[“validation”].select(range(32)),
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics,
)
