Huggingface accelerate test error when num_processes>2

85 Views Asked by At

I am using a server with 8 GPUs. I tried to run accelerate test in my terminal but I get this error. Below is all the output.

[2024-02-23 13:23:19,151] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

Running:  accelerate-launch /home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2024-02-23 13:23:23,448] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2024-02-23 13:23:27,230] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2024-02-23 13:23:27,307] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2024-02-23 13:23:27,317] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2024-02-23 13:23:27,388] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 3
stdout: Local process index: 3
stdout: Device: cuda:3
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 2
stdout: Local process index: 2
stdout: Device: cuda:2
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: 
stdout: **Test process execution**
stdout: Distributed environment: MULTI_GPU  Backend: nccl
stdout: Num processes: 4
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout: 
stdout: Mixed precision type: fp16
stdout: 
stdout: 
stdout: **Test split between processes as a list**
stdout: 
stdout: **Test split between processes as a dict**
stderr: Traceback (most recent call last):
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
stderr:     main()
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 523, in main
stderr:     test_split_between_processes_list()
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 459, in test_split_between_processes_list
stderr:     with state.split_between_processes(data, apply_padding=True) as results:
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/contextlib.py", line 119, in __enter__
stderr:     return next(self.gen)
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 854, in split_between_processes
stderr:     with PartialState().split_between_processes(inputs, apply_padding=apply_padding) as inputs:
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/contextlib.py", line 119, in __enter__
stderr:     return next(self.gen)
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 418, in split_between_processes
stderr:     yield _split_values(inputs, start_index, end_index)
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 409, in _split_values
stderr:     result += [result[-1]] * (num_samples_per_process - len(result))
stderr: IndexError: list index out of range
stdout: 
stdout: **Test split between processes as a tensor**
stderr: Traceback (most recent call last):
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
stderr:     main()
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 527, in main
stderr:     test_split_between_processes_nested_dict()
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 479, in test_split_between_processes_nested_dict
stderr:     assert results["a"] == data_copy["a"][-1]
stderr: AssertionError
stdout: 
stdout: **Test random number generator synchronization**
stderr: Traceback (most recent call last):
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
stderr:     main()
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 527, in main
stderr:     test_split_between_processes_nested_dict()
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 479, in test_split_between_processes_nested_dict
stderr:     assert results["a"] == data_copy["a"][-1]
stderr: AssertionError
stderr: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 514870 closing signal SIGTERM
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 514871) of binary: /home/Althea/miniconda3/envs/timellm/bin/python
stderr: Traceback (most recent call last):
stderr:   File "/home/Althea/miniconda3/envs/timellm/bin/accelerate-launch", line 8, in <module>
stderr:     sys.exit(main())
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 947, in main
stderr:     launch_command(args)
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 932, in launch_command
stderr:     multi_gpu_launcher(args)
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
stderr:     distrib_run.run(args)
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
stderr:     elastic_launch(
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
stderr:     return launch_agent(self._config, self._entrypoint, list(args))
stderr:   File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
stderr:     raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr: [1]:
stderr:   time      : 2024-02-23_13:23:35
stderr:   host      : Server
stderr:   rank      : 2 (local_rank: 2)
stderr:   exitcode  : 1 (pid: 514872)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [2]:
stderr:   time      : 2024-02-23_13:23:35
stderr:   host      : Server
stderr:   rank      : 3 (local_rank: 3)
stderr:   exitcode  : 1 (pid: 514873)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr:   time      : 2024-02-23_13:23:35
stderr:   host      : Server
stderr:   rank      : 1 (local_rank: 1)
stderr:   exitcode  : 1 (pid: 514871)
stderr:   error_file: <N/A>
stderr:   traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
Traceback (most recent call last):
  File "/home/Althea/miniconda3/envs/timellm/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/test.py", line 54, in test_command
    result = execute_subprocess_async(cmd, env=os.environ.copy())
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/testing.py", line 383, in execute_subprocess_async
    raise RuntimeError(
RuntimeError: 'accelerate-launch /home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1

The combined stderr from workers follows:
Traceback (most recent call last):
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
    main()
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 523, in main
    test_split_between_processes_list()
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 459, in test_split_between_processes_list
    with state.split_between_processes(data, apply_padding=True) as results:
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 854, in split_between_processes
    with PartialState().split_between_processes(inputs, apply_padding=apply_padding) as inputs:
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 418, in split_between_processes
    yield _split_values(inputs, start_index, end_index)
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 409, in _split_values
    result += [result[-1]] * (num_samples_per_process - len(result))
IndexError: list index out of range
Traceback (most recent call last):
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
    main()
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 527, in main
    test_split_between_processes_nested_dict()
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 479, in test_split_between_processes_nested_dict
    assert results["a"] == data_copy["a"][-1]
AssertionError
Traceback (most recent call last):
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
    main()
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 527, in main
    test_split_between_processes_nested_dict()
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 479, in test_split_between_processes_nested_dict
    assert results["a"] == data_copy["a"][-1]
AssertionError
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 514870 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 514871) of binary: /home/Althea/miniconda3/envs/timellm/bin/python
Traceback (most recent call last):
  File "/home/Althea/miniconda3/envs/timellm/bin/accelerate-launch", line 8, in <module>
    sys.exit(main())
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 947, in main
    launch_command(args)
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 932, in launch_command
    multi_gpu_launcher(args)
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
    distrib_run.run(args)
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-02-23_13:23:35
  host      : Server
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 514872)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
  time      : 2024-02-23_13:23:35
  host      : Server
  rank      : 3 (local_rank: 3)
  exitcode  : 1 (pid: 514873)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-02-23_13:23:35
  host      : Server
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 514871)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

This is my configuration file (default_config.yaml):

compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1,2,3,4,5,6,7
machine_rank: 0
main_training_function: main
main_process_port: 33827
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

Package versions: transformers 4.31.0 accelerate 0.20.3

I tried setting num_processes in default_config.yaml to 1 or 2 and it works fine, but if I set it to anything larger than 2 (not including 2), the above error occurs.

1

There are 1 best solutions below

0
Yaoming Xuan On

num_process should be equal to the number of GPUs you have. You should try setting everything to default first and see if it's working. If not, recheck your cuda version, accelerate version and hardware settings.