I am using a server with 8 GPUs. I tried to run accelerate test in my terminal but I get this error. Below is all the output.
[2024-02-23 13:23:19,151] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
Running: accelerate-launch /home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py
stdout: [2024-02-23 13:23:23,448] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2024-02-23 13:23:27,230] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2024-02-23 13:23:27,307] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2024-02-23 13:23:27,317] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: [2024-02-23 13:23:27,388] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 4
stdout: Process index: 3
stdout: Local process index: 3
stdout: Device: cuda:3
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 4
stdout: Process index: 2
stdout: Local process index: 2
stdout: Device: cuda:2
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout: **Initialization**
stdout: Testing, testing. 1, 2, 3.
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 4
stdout: Process index: 0
stdout: Local process index: 0
stdout: Device: cuda:0
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout:
stdout: **Test process execution**
stdout: Distributed environment: MULTI_GPU Backend: nccl
stdout: Num processes: 4
stdout: Process index: 1
stdout: Local process index: 1
stdout: Device: cuda:1
stdout:
stdout: Mixed precision type: fp16
stdout:
stdout:
stdout: **Test split between processes as a list**
stdout:
stdout: **Test split between processes as a dict**
stderr: Traceback (most recent call last):
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
stderr: main()
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 523, in main
stderr: test_split_between_processes_list()
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 459, in test_split_between_processes_list
stderr: with state.split_between_processes(data, apply_padding=True) as results:
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/contextlib.py", line 119, in __enter__
stderr: return next(self.gen)
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 854, in split_between_processes
stderr: with PartialState().split_between_processes(inputs, apply_padding=apply_padding) as inputs:
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/contextlib.py", line 119, in __enter__
stderr: return next(self.gen)
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 418, in split_between_processes
stderr: yield _split_values(inputs, start_index, end_index)
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 409, in _split_values
stderr: result += [result[-1]] * (num_samples_per_process - len(result))
stderr: IndexError: list index out of range
stdout:
stdout: **Test split between processes as a tensor**
stderr: Traceback (most recent call last):
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
stderr: main()
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 527, in main
stderr: test_split_between_processes_nested_dict()
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 479, in test_split_between_processes_nested_dict
stderr: assert results["a"] == data_copy["a"][-1]
stderr: AssertionError
stdout:
stdout: **Test random number generator synchronization**
stderr: Traceback (most recent call last):
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
stderr: main()
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 527, in main
stderr: test_split_between_processes_nested_dict()
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 479, in test_split_between_processes_nested_dict
stderr: assert results["a"] == data_copy["a"][-1]
stderr: AssertionError
stderr: WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 514870 closing signal SIGTERM
stderr: ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 514871) of binary: /home/Althea/miniconda3/envs/timellm/bin/python
stderr: Traceback (most recent call last):
stderr: File "/home/Althea/miniconda3/envs/timellm/bin/accelerate-launch", line 8, in <module>
stderr: sys.exit(main())
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 947, in main
stderr: launch_command(args)
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 932, in launch_command
stderr: multi_gpu_launcher(args)
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
stderr: distrib_run.run(args)
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
stderr: elastic_launch(
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
stderr: return launch_agent(self._config, self._entrypoint, list(args))
stderr: File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
stderr: raise ChildFailedError(
stderr: torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
stderr: ============================================================
stderr: /home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
stderr: ------------------------------------------------------------
stderr: Failures:
stderr: [1]:
stderr: time : 2024-02-23_13:23:35
stderr: host : Server
stderr: rank : 2 (local_rank: 2)
stderr: exitcode : 1 (pid: 514872)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: [2]:
stderr: time : 2024-02-23_13:23:35
stderr: host : Server
stderr: rank : 3 (local_rank: 3)
stderr: exitcode : 1 (pid: 514873)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ------------------------------------------------------------
stderr: Root Cause (first observed failure):
stderr: [0]:
stderr: time : 2024-02-23_13:23:35
stderr: host : Server
stderr: rank : 1 (local_rank: 1)
stderr: exitcode : 1 (pid: 514871)
stderr: error_file: <N/A>
stderr: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
stderr: ============================================================
Traceback (most recent call last):
File "/home/Althea/miniconda3/envs/timellm/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/test.py", line 54, in test_command
result = execute_subprocess_async(cmd, env=os.environ.copy())
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/testing.py", line 383, in execute_subprocess_async
raise RuntimeError(
RuntimeError: 'accelerate-launch /home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py' failed with returncode 1
The combined stderr from workers follows:
Traceback (most recent call last):
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
main()
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 523, in main
test_split_between_processes_list()
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 459, in test_split_between_processes_list
with state.split_between_processes(data, apply_padding=True) as results:
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/contextlib.py", line 119, in __enter__
return next(self.gen)
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 854, in split_between_processes
with PartialState().split_between_processes(inputs, apply_padding=apply_padding) as inputs:
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/contextlib.py", line 119, in __enter__
return next(self.gen)
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 418, in split_between_processes
yield _split_values(inputs, start_index, end_index)
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/state.py", line 409, in _split_values
result += [result[-1]] * (num_samples_per_process - len(result))
IndexError: list index out of range
Traceback (most recent call last):
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
main()
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 527, in main
test_split_between_processes_nested_dict()
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 479, in test_split_between_processes_nested_dict
assert results["a"] == data_copy["a"][-1]
AssertionError
Traceback (most recent call last):
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 553, in <module>
main()
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 527, in main
test_split_between_processes_nested_dict()
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py", line 479, in test_split_between_processes_nested_dict
assert results["a"] == data_copy["a"][-1]
AssertionError
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 514870 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 514871) of binary: /home/Althea/miniconda3/envs/timellm/bin/python
Traceback (most recent call last):
File "/home/Althea/miniconda3/envs/timellm/bin/accelerate-launch", line 8, in <module>
sys.exit(main())
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 947, in main
launch_command(args)
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 932, in launch_command
multi_gpu_launcher(args)
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/commands/launch.py", line 627, in multi_gpu_launcher
distrib_run.run(args)
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/Althea/miniconda3/envs/timellm/lib/python3.9/site-packages/accelerate/test_utils/scripts/test_script.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2024-02-23_13:23:35
host : Server
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 514872)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2024-02-23_13:23:35
host : Server
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 514873)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-02-23_13:23:35
host : Server
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 514871)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
This is my configuration file (default_config.yaml):
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1,2,3,4,5,6,7
machine_rank: 0
main_training_function: main
main_process_port: 33827
mixed_precision: fp16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
Package versions: transformers 4.31.0 accelerate 0.20.3
I tried setting num_processes in default_config.yaml to 1 or 2 and it works fine, but if I set it to anything larger than 2 (not including 2), the above error occurs.
num_processshould be equal to the number of GPUs you have. You should try setting everything to default first and see if it's working. If not, recheck your cuda version, accelerate version and hardware settings.