MPI hello_world to test infiniband

4.7k Views Asked by At

I have virtual machine which has passthrough infiniband nic. I am testing inifinband functionality using hello world program. I am new in this world so may need help to understand following error

I have install openmpi on ubuntu using apt-get command

spatel@ib-1:~$ mpirun -V
mpirun (Open MPI) 4.0.3

Infiniband nic

spatel@ib-1:~$ lspci -nn | grep -i mell
00:05.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6 Virtual Function] [15b3:101c]

My hello world program

spatel@ib-1:~$ mpirun -np 2 ./mpi_hello_world
--------------------------------------------------------------------------
WARNING: No preset parameters were found for the device that Open MPI
detected:

  Local host:            ib-1
  Device name:           mlx5_0
  Device vendor ID:      0x02c9
  Device vendor part ID: 4124

Default device parameters will be used, which may result in lower
performance.  You can edit any of the files specified by the
btl_openib_device_param_files MCA parameter to set values for your
device.

NOTE: You can turn off this warning by setting the MCA parameter
      btl_openib_warn_no_device_params_found to 0.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
By default, for Open MPI 4.0 and later, infiniband ports on a device
are not used by default.  The intent is to use UCX for these devices.
You can override this policy by setting the btl_openib_allow_ib MCA parameter
to true.

  Local host:              ib-1
  Local adapter:           mlx5_0
  Local port:              1

--------------------------------------------------------------------------
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   ib-1
  Local device: mlx5_0
--------------------------------------------------------------------------
--------------------------------------------------------------------------
Open MPI failed an OFI Libfabric library call (fi_endpoint).  This is highly
unusual; your job may behave unpredictably (and/or abort) after this.

  Local host: ib-1
  Location: mtl_ofi_component.c:629
  Error: Unspecified error (256)
--------------------------------------------------------------------------
Hello world from processor ib-1, rank 0 out of 2 processors
Hello world from processor ib-1, rank 1 out of 2 processors
[ib-1:65704] 1 more process has sent help message help-mpi-btl-openib.txt / no device params found
[ib-1:65704] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ib-1:65704] 1 more process has sent help message help-mpi-btl-openib.txt / ib port not selected
[ib-1:65704] 1 more process has sent help message help-mpi-btl-openib.txt / error in device init
[ib-1:65704] 1 more process has sent help message help-mtl-ofi.txt / OFI call fail

It throws bunch of warning and error so not sure what i should understand, does it use ib interface to run this job?

UPDATE

After suggested by @Gilles Gouaillardet in comment i have compiled ompi with ucx and now i am seeing following output during hello_world prog

spatel@ib-1:~$ /home/spatel/ompi/bin/mpirun -np 2 ./hello_world_ucx --mca opal_common_ucx_opal_mem_hooks 1
--------------------------------------------------------------------------
PMIx was unable to find a usable compression library
on the system. We will therefore be unable to compress
large data streams. This may result in longer-than-normal
startup times and larger memory footprints. We will
continue, but strongly recommend installing zlib or
a comparable compression library for better user experience.

You can suppress this warning by adding "pcompress_base_silence_warning=1"
to your PMIx MCA default parameter file, or by adding
"PMIX_MCA_pcompress_base_silence_warning=1" to your environment.
--------------------------------------------------------------------------

Hello world from processor ib-1, rank 0 out of 2 processors
Hello world from processor ib-1, rank 1 out of 2 processors

Now to test my infiniband network i created similar another vm ib-2 with inifinband nic to see hello_world using RDMA for communication.

/home/spatel/ompi/bin/mpirun --host ib-1,ib-2 -np 2 ./hello_world_ucx --mca opal_common_ucx_opal_mem_hooks 1

Same time i run tcpdump on ibs5 interface which is my Infiniband nic but i see no activity and notice MPI messages still using traditional nic eth0 for communication. how do i make sure it use only infiniband for MPI (i don't have any IP configure on ib nic)

0

There are 0 best solutions below