Getting "INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 4 ports provided..." on a fresh TPU-v2-8 VM

33 Views Asked by At

I am currently working on distributed inferencing for TPUs. I created a fresh TPU-v2-8 VM and tried to a run my pytorch-xla code on it.

VM creation command:

gcloud compute tpus tpu-vm create <vm-name> \
    --project <project-id> \
    --zone=us-central1-f \
    --accelerator-type=v2-8 \
    --version=tpu-ubuntu2204-base \
    --data-disk source=<disk-config>

I am getting "Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 4 ports provided in..."

The entire stack trace is given below:

WARNING:root:PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md
WARNING:root:libtpu.so and TPU device found. Setting PJRT_DEVICE=TPU.
https://symbolize.stripped_domain/r/?trace=7fb561a969fc,7fb561a4251f&map= 
*** SIGABRT received by PID 293850 (TID 293850) on cpu 35 from PID 293850; stack trace: ***
PC: @     0x7fb561a969fc  (unknown)  pthread_kill
    @     0x7fb4045aa53a       1152  (unknown)
    @     0x7fb561a42520  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7fb561a969fc,7fb4045aa539,7fb561a4251f&map=abbd016d9542b8098892badc0b19ea68:7fb3f7400000-7fb4047becf0 
E0119 08:55:08.340662  293850 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 08:55:08.340686  293850 coredump_hook.cc:486] RAW: Skipping coredump since rlimit was 0 at process start.
E0119 08:55:08.340695  293850 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 08:55:08.340702  293850 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 08:55:08.340735  293850 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 08:55:08.340758  293850 coredump_hook.cc:603] RAW: Dumping core locally.
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1705654508.342848  293845 common_lib.cc:822] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8477 in any of the 4 ports provided in `tpu_process_addresses`=localhost:8476,localhost:8477,localhost:8478,localhost:8479.
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:539
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1705654508.361102  293849 common_lib.cc:822] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8478 in any of the 4 ports provided in `tpu_process_addresses`=localhost:8476,localhost:8477,localhost:8478,localhost:8479.
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:539
WARNING: Logging before InitGoogle() is written to STDERR
E0000 00:00:1705654508.408523  293848 common_lib.cc:822] Could not set metric server port: INVALID_ARGUMENT: Could not find SliceBuilder port 8479 in any of the 4 ports provided in `tpu_process_addresses`=localhost:8476,localhost:8477,localhost:8478,localhost:8479.
=== Source Location Trace: === 
learning/45eac/tfrc/runtime/common_lib.cc:539
https://symbolize.stripped_domain/r/?trace=7f31902969fc,7f319024251f&map= 
*** SIGABRT received by PID 293845 (TID 293845) on cpu 56 from PID 293845; stack trace: ***
PC: @     0x7f31902969fc  (unknown)  pthread_kill
    @     0x7f3032faa53a       1152  (unknown)
    @     0x7f3190242520  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f31902969fc,7f3032faa539,7f319024251f&map=abbd016d9542b8098892badc0b19ea68:7f3025e00000-7f30331becf0 
E0119 08:55:08.431729  293845 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 08:55:08.431761  293845 coredump_hook.cc:486] RAW: Skipping coredump since rlimit was 0 at process start.
E0119 08:55:08.431793  293845 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 08:55:08.431808  293845 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 08:55:08.431843  293845 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 08:55:08.431855  293845 coredump_hook.cc:603] RAW: Dumping core locally.
https://symbolize.stripped_domain/r/?trace=7f611b8969fc,7f611b84251f&map= 
*** SIGABRT received by PID 293849 (TID 293849) on cpu 21 from PID 293849; stack trace: ***
PC: @     0x7f611b8969fc  (unknown)  pthread_kill
    @     0x7f5fbe3aa53a       1152  (unknown)
    @     0x7f611b842520  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7f611b8969fc,7f5fbe3aa539,7f611b84251f&map=abbd016d9542b8098892badc0b19ea68:7f5fb1200000-7f5fbe5becf0 
E0119 08:55:08.449987  293849 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 08:55:08.450014  293849 coredump_hook.cc:486] RAW: Skipping coredump since rlimit was 0 at process start.
E0119 08:55:08.450029  293849 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 08:55:08.450041  293849 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 08:55:08.450085  293849 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 08:55:08.450100  293849 coredump_hook.cc:603] RAW: Dumping core locally.
https://symbolize.stripped_domain/r/?trace=7faeabc969fc,7faeabc4251f&map= 
*** SIGABRT received by PID 293848 (TID 293848) on cpu 21 from PID 293848; stack trace: ***
PC: @     0x7faeabc969fc  (unknown)  pthread_kill
    @     0x7fad4e7aa53a       1152  (unknown)
    @     0x7faeabc42520  (unknown)  (unknown)
https://symbolize.stripped_domain/r/?trace=7faeabc969fc,7fad4e7aa539,7faeabc4251f&map=abbd016d9542b8098892badc0b19ea68:7fad41600000-7fad4e9becf0 
E0119 08:55:08.498424  293848 coredump_hook.cc:447] RAW: Remote crash data gathering hook invoked.
E0119 08:55:08.498443  293848 coredump_hook.cc:486] RAW: Skipping coredump since rlimit was 0 at process start.
E0119 08:55:08.498452  293848 client.cc:272] RAW: Coroner client retries enabled (b/136286901), will retry for up to 30 sec.
E0119 08:55:08.498460  293848 coredump_hook.cc:542] RAW: Sending fingerprint to remote end.
E0119 08:55:08.498490  293848 coredump_hook.cc:551] RAW: Cannot send fingerprint to Coroner: [NOT_FOUND] stat failed on crash reporting socket /var/google/services/logmanagerd/remote_coredump.socket (Is the listener running?): No such file or directory
E0119 08:55:08.498510  293848 coredump_hook.cc:603] RAW: Dumping core locally.
E0119 08:55:08.597408  293850 process_state.cc:783] RAW: Raising signal 6 with default behavior
E0119 08:55:08.693804  293845 process_state.cc:783] RAW: Raising signal 6 with default behavior
E0119 08:55:08.711156  293849 process_state.cc:783] RAW: Raising signal 6 with default behavior
E0119 08:55:08.749858  293848 process_state.cc:783] RAW: Raising signal 6 with default behavior

Even basic code like the one given below is not working:

import torch_xla.distributed.xla_multiprocessing as xmp

def _mp_fn(index):
    pass

if __name__ == "__main__":

    xmp.spawn(
        _mp_fn, 
        start_method='spawn'
    )

How should I fix this?

0

There are 0 best solutions below