EFA Cheatsheet

1. Settings via environment variables

For optimized performance, you may need to set additional environment variables depending on the versions of your libfabric.

Setting	Description
`FI_EFA_USE_HUGE_PAGE=0`	Set to 0 when you see `os.fork()` causes `OSError: Cannot allocate memory`. Typically caused by multi-process PyTorch data loaders. Disabling huge page causes minor performance hit, but it's needed to prevent fork fails due to the operating system running out of huge pages.
`FI_EFA_FORK_SAFE=1`	Not needed anymore. It used to be needed for `kernel<5.15` (see this). However, all reasonably recent versions of the plugin (since at least v1.2, probably even older) always set this flag for supported versions of Libfabric, regardless of kernel version.
`FI_EFA_USE_DEVICE_RDMA=1`	Do not set for libfabric>=1.18.0 and aws-ofi-nccl>=1.7.0. It's not harmful to set it on p4/p5 instances with the newer software, but you just don't have to set it.
`FI_EFA_ENABLE_SHM_TRANSFER=1`	Not needed. Libfabric will disable SHM transfer when the application sets `FI_OPT_CUDA_API_PERMITTED` to false, which this plugin does (see discussion here.)
`FI_PROVIDER=efa`	Use for aws-ofi-nccl<=1.5.0 AND EFA-enabled GPU instances.
`NCCL_PROTO=simple`	Use for aws-ofi-nccl<=1.5.0 AND EFA-enabled GPU instances
`NCCL_SOCKET_NTHREADS`	Not applicable for EFA.
`NCCL_NSOCKS_PERTHREAD`	Not applicable for EFA.
`NCCL_MIN_CHANNELS=xxx`	Recommend to leave it out to use the default. For e.g., on p4d/p4de, the number of channels should be 8, which is the minimum for a 4-NIC platform. The reduction message is split by number of GPUs in the job, then the number of channels, so having more channels than necessary causes smaller messages which causes EFA to be starved for data.
`NCCL_BUFFSIZE=xxx`	Recommend to leave it out to use the default.
`RDMAV_FORK_SAFE=1`	Do not use. This is a RDMA-core environment variable. Prefer `FI_EFA_FORK_SAFE` (if it still makes sense for your Linux kernel version). The two look the same, but actually behaves very differently, especially on newer kernels, where `RDMAV_FORK_SAFE=1` can break things.
`RDMAV_*`	Do not use.
NCCL version	Recommend one of the stable releases.

2. Sample Presets

The following table shows the environment variables you need to set for common library versions.

As of this writing, only p4 and p5 instances support both EFA and NVIDIA GPUDirect Remote Direct Memory Access (RDMA).

Situation	Actions
p5.48xlarge	• cuda>=12.0 • nccl>=2.18.0 (recommend at least 2.18.5) • aws-ofi-nccl>=1.7.2 (recommend at least 1.7.3) • efa-installer>=1.29.0 (to avoid nccl>=2.19.0 raising libfabric errors)
Memory allocation errors	export FI_EFA_USE_HUGE_PAGE=0
• libfabric>=1.18.0 • aws-ofi-nccl>=1.7.0	N/A
• aws-ofi-nccl>=1.6.0,<1.7.0 • p4/p5 instances	export FI_EFA_USE_DEVICE_RDMA=1
• aws-ofi-nccl>=1.6.0,<1.7.0 • EFA instances without RDMA	N/A
• aws-ofi-nccl<=1.5.0 • p4/p5 instances	export FI_EFA_USE_DEVICE_RDMA=1 export FI_PROVIDER=efa export NCCL_PROTO=simple
• aws-ofi-nccl<=1.5.0 • EFA instances without RDMA	export FI_PROVIDER=efa export NCCL_PROTO=simple

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EFA Cheatsheet

1. Settings via environment variables

2. Sample Presets

FilesExpand file tree

efa-env-var.md

Latest commit

History

efa-env-var.md

File metadata and controls

EFA Cheatsheet

1. Settings via environment variables

2. Sample Presets