Skip to content

Latest commit

 

History

History
153 lines (145 loc) · 5.07 KB

File metadata and controls

153 lines (145 loc) · 5.07 KB

EFA Cheatsheet

1. Settings via environment variables

For optimized performance, you may need to set additional environment variables depending on the versions of your libfabric.

Setting Description
FI_EFA_USE_HUGE_PAGE=0 Set to 0 when you see os.fork() causes OSError: Cannot allocate memory. Typically caused by multi-process PyTorch data loaders. Disabling huge page causes minor performance hit, but it's needed to prevent fork fails due to the operating system running out of huge pages.
FI_EFA_FORK_SAFE=1 Not needed anymore.

It used to be needed for kernel<5.15 (see this). However, all reasonably recent versions of the plugin (since at least v1.2, probably even older) always set this flag for supported versions of Libfabric, regardless of kernel version.

FI_EFA_USE_DEVICE_RDMA=1 Do not set for libfabric>=1.18.0 and aws-ofi-nccl>=1.7.0. It's not harmful to set it on p4/p5 instances with the newer software, but you just don't have to set it.
FI_EFA_ENABLE_SHM_TRANSFER=1 Not needed.

Libfabric will disable SHM transfer when the application sets FI_OPT_CUDA_API_PERMITTED to false, which this plugin does (see discussion here.)

FI_PROVIDER=efa Use for aws-ofi-nccl<=1.5.0 AND EFA-enabled GPU instances.
NCCL_PROTO=simple Use for aws-ofi-nccl<=1.5.0 AND EFA-enabled GPU instances
NCCL_SOCKET_NTHREADS Not applicable for EFA.
NCCL_NSOCKS_PERTHREAD Not applicable for EFA.
NCCL_MIN_CHANNELS=xxx Recommend to leave it out to use the default. For e.g., on p4d/p4de, the number of channels should be 8, which is the minimum for a 4-NIC platform. The reduction message is split by number of GPUs in the job, then the number of channels, so having more channels than necessary causes smaller messages which causes EFA to be starved for data.
NCCL_BUFFSIZE=xxx Recommend to leave it out to use the default.
RDMAV_FORK_SAFE=1 Do not use. This is a RDMA-core environment variable. Prefer FI_EFA_FORK_SAFE (if it still makes sense for your Linux kernel version). The two look the same, but actually behaves very differently, especially on newer kernels, where RDMAV_FORK_SAFE=1 can break things.
RDMAV_* Do not use.
NCCL version Recommend one of the stable releases.

2. Sample Presets

The following table shows the environment variables you need to set for common library versions.

As of this writing, only p4 and p5 instances support both EFA and NVIDIA GPUDirect Remote Direct Memory Access (RDMA).

Situation Actions
p5.48xlarge • cuda>=12.0
• nccl>=2.18.0 (recommend at least 2.18.5)
• aws-ofi-nccl>=1.7.2 (recommend at least 1.7.3)
• efa-installer>=1.29.0 (to avoid nccl>=2.19.0 raising libfabric errors)
Memory allocation errors
export FI_EFA_USE_HUGE_PAGE=0
• libfabric>=1.18.0
• aws-ofi-nccl>=1.7.0
N/A
• aws-ofi-nccl>=1.6.0,<1.7.0
• p4/p5 instances
export FI_EFA_USE_DEVICE_RDMA=1
• aws-ofi-nccl>=1.6.0,<1.7.0
• EFA instances without RDMA
N/A
• aws-ofi-nccl<=1.5.0
• p4/p5 instances
export FI_EFA_USE_DEVICE_RDMA=1
export FI_PROVIDER=efa
export NCCL_PROTO=simple
• aws-ofi-nccl<=1.5.0
• EFA instances without RDMA
export FI_PROVIDER=efa
export NCCL_PROTO=simple