MPI inter-node issues with Intel MPI v2019 on Mellanox IB

I tested the execution of a simple inter-node job between two nodes over our Infiniband network with updates 5, 6 and 7 of Intel MPI v2019 and I found very different results for each release. All tests were carried out with `iccifort/2020.1.217` as base of the toolchain.

Characteristics of the testing system
- CPU: 2x Intel(R) Xeon(R) Gold 6126
- Adapter: Mellanox Technologies MT27700 Family [ConnectX-4]
- Operative System: Cent OS 7.7
- Related system libraries: UCX v1.5.1, OFED v4.7-3.2.9
- ICC: v2020.1 (from Easybuild)
- Resource manager: Torque

Steps to reproduce:
1. Start a job on two nodes
2. Load `impi`
3. `mpicc ${EBROOTIMPI}/test/test.c -o test`
4. `mpirun ./test`

**Intel MPI v2019 update 5: works out of the box**
```
$ module load impi/2019.5.281-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.7.2a
libfabric: 1.7.2a
libfabric api: 1.7
$ fi_info | grep provider
provider: verbs;ofi_rxm
provider: verbs;ofi_rxd
provider: verbs
provider: verbs
provider: verbs
$ mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
```

**Intel MPI v2019 update 6: does NOT work out of the box, but can be fixed**
```
$ module load impi/2019.6.166-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.9.0a1
libfabric: 1.9.0a1-impi
libfabric api: 1.8
$ fi_info | grep provider
provider: mlx
provider: mlx;ofi_rxm
$ mpirun ./test
[1585832682.960816] [node357:302190:0]         select.c:406  UCX  ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy, mm/sysv - Destination is unreachable, mm/posix - Destination is unreachable, cma/cma - no am bcopy
Abort(1091471) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(703)........: 
MPID_Init(958)...............: 
MPIDI_OFI_mpi_init_hook(1382): OFI get address vector map failed
```
* **Solution 1**: use `verbs` or `tcp` libfabric providers instead of `mlx`
```
$ module load impi/2019.6.166-iccifort-2020.1.217
$ FI_PROVIDER=verbs,tcp mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
```
* **Solution 2**: use a more up to date UCX. [Intel claims that at least v1.4 is required](https://software.intel.com/en-us/articles/improve-performance-and-stability-with-intel-mpi-library-on-infiniband) for `mlx`, but for us it only works with UCX v1.7 (available in Easybuild).
```
$ module load impi/2019.6.166-iccifort-2020.1.217
$ module load UCX/1.7.0-GCCcore-9.3.0
$ ucx_info
# UCT version=1.7.0 revision 
# configured with: --prefix=/user/brussel/101/vsc10122/.local/easybuild/software/UCX/1.7.0-GCCcore-9.3.0 --build=x86_64-pc-linux-gnu --host=x86_64-pc-linux-gnu --enable-optimizations --enable-cma --enable-mt --with-verbs --without-java --disable-doxygen-doc
$ FI_PROVIDER=mlx mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
```
* **Solution 3**: use external libfabric v1.9.1. Upstream libfabric dropped `mlx` with version 1.9.0
```
$ module load impi/2019.6.166-iccifort-2020.1.217
$ module load libfabric/1.9.1-GCCcore-9.3.0
$ export FI_PROVIDER_PATH=
$ fi_info --version
fi_info: 1.9.1
libfabric: 1.9.1
libfabric api: 1.9
$ mpirun ./test
Hello world: rank 0 of 2 running on node357.hydra.os
Hello world: rank 1 of 2 running on node356.hydra.os
```

**Intel MPI v2019 update 7: does NOT work at all**
```
$ module load impi/2019.7.217-iccifort-2020.1.217
$ fi_info --version
fi_info: 1.10.0a1
libfabric: 1.10.0a1-impi
libfabric api: 1.9
$ fi_info | grep provider
provider: verbs;ofi_rxm
[...]
provider: tcp;ofi_rxm
[...]
provider: verbs
[...]
provider: tcp
[...]
provider: sockets
[...]
$ $ I_MPI_DEBUG=4 I_MPI_HYDRA_DEBUG=on FI_LOG_LEVEL=debug mpirun ./test
[mpiexec@node357.hydra.os] Launch arguments: /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 40969 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 0 --node-id 0 --subtree-size 1 --upstream-fd 7 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[mpiexec@node357.hydra.os] Launch arguments: /usr/bin/ssh -q -x node356.hydra.brussel.vsc /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_bstrap_proxy --upstream-host node357.hydra.brussel.vsc --upstream-port 40969 --pgid 0 --launcher ssh --launcher-number 0 --base-path /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin/ --tree-width 16 --tree-level 1 --time-left -1 --collective-launch 1 --debug --proxy-id 1 --node-id 1 --subtree-size 1 /user/brussel/101/vsc10122/.local/easybuild/software/impi/2019.7.217-iccifort-2020.1.217/intel64/bin//hydra_pmi_proxy --usize -1 --auto-cleanup 1 --abort-signal 9 
[proxy:0:0@node357.hydra.os] Warning - oversubscription detected: 1 processes will be placed on 0 cores
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=init pmi_version=1 pmi_subversion=1
[proxy:0:1@node356.hydra.os] PMI response: cmd=response_to_init pmi_version=1 pmi_subversion=1 rc=0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_maxes
[proxy:0:1@node356.hydra.os] PMI response: cmd=maxes kvsname_max=256 keylen_max=64 vallen_max=4096
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_appnum
[proxy:0:1@node356.hydra.os] PMI response: cmd=appnum appnum=0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get_my_kvsname
[proxy:0:1@node356.hydra.os] PMI response: cmd=my_kvsname kvsname=kvs_309778_0
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=get kvsname=kvs_309778_0 key=PMI_process_mapping
[proxy:0:1@node356.hydra.os] PMI response: cmd=get_result rc=0 msg=success value=(vector,(0,2,1))
[proxy:0:1@node356.hydra.os] pmi cmd from fd 4: cmd=barrier_in
```
_(the execution does not stop, it just hangs at this point)_

The system log of the node shows the following entry
```
traps: hydra_pmi_proxy[549] trap divide error ip:4436ed sp:7ffed012ef50 error:0 in hydra_pmi_proxy[400000+ab000]
```
This error with IMPI v2019.7 happens way before initializing libfabric. Therefore, it does not depend on the provider or the version of UCX. It happens all the time.

**Update**
* **Solution with Torque**: https://github.com/easybuilders/easybuild-easyconfigs/issues/10314#issuecomment-616558265

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MPI inter-node issues with Intel MPI v2019 on Mellanox IB #10314

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MPI inter-node issues with Intel MPI v2019 on Mellanox IB #10314

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions