I am using a HPC cluster with LSF Resource manager. The task is to train a tensorflow model. The graphics queue gq (Max jobs 96) has two hosts host1 with 4 GPUs and host2 with 3 GPUs. I want to use the total 7 GPUs together.
The below are various combinations of bsub commands and the results:-
1.
bsub -q gq -n 96 -gpu "num=7:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"
PENDING REASONS: There are no suitable hosts for the job
2.
bsub -q gq -n 96 -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"
PENDING REASONS: Not enough hosts to meet the job's spanning requirement;
3.
bsub -q gq -n 96 -gpu "num=3:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"
Running. 96 Tasks started on 2 hosts. But only 3 GPUs from Host 1 are used
4.
bsub -q gq -gpu "num=4:mode=shared:j_exclusive=no" -W 6:00 -o output2.log -e error2.log "module load cuda; ./file.sh"
Running. 1 Tasks started on host1. All 4 GPUs from Host 1 are used
What should I do so that I can use all 7 GPUs and 96 tasks across both the hosts??