nccom-test is a benchmarking tool for quickly evaluating the performance of Collective Communication operations on one or more Neuron instances (it is compatible with both trn1 and inf2 instance types) or just for a fast sanity check of the environment before attempting to run a more complex workload.
The command will output a table containing several columns containing performance metrics. There will be a line for every requested data size (by default the data size is 32MB as seen in the previous example).
| Column name | Description |
|---|---|
| size (B) | Size in bytes for the data involved in this operation |
| count(elems) | Number of elements in the data involved in this operation. For example, if size(B) is 4 and type is fp32, then count will be 1 since one single fp32 element has been processed. |
| type | Data type for the processed data. Can be: uint8, int8, uint16, int16, fp16, bf16, int32, uint32, fp32 |
| time(us) | Time in microseconds representing the P50 of all durations for the Collective Communication operations executed during the benchmark. |
| algbw(GB/s) | Algorithm bandwidth in gibibytes (1GiB = 1,073,741,824 bytes) per second which is calculated as size(B)/time(us) |
| busbw(GB/s) | Bus bandwidth - bandwidth per data line in gibibytes per second - it provides a bandwidth number that is independent from the number of ranks (unlike algbw). For a more in-depth explanation on bus Bandwidth, please refer to NVIDIA’s nccl-tests documentation. |
| Avg bus bandwidth | Average of the values in the busbw column |
| Argument | Default value | Description |
|---|---|---|
<cc operation> |
N/A, required argument | The type of Collective Communication operation to execute for this benchmark. Supported types: - all_reduce / allr: All-Reduce - all_gather / allg: All-Gather - reduce_scatter / redsct: Reduce-Scatter - sendrecv: Send-Receive - alltoall: All-to-All |
-r, --nworkers |
N/A, required argument | Total number of workers (ranks) to use |
-N, --nnodes |
1 | Total number of nodes (instances) to use. The number of workers will be divided equally across all nodes. If this argument is greater than 1, the NEURON_RT_ROOT_COMM_ID environment variable needs to be set to the host address of the instance nccom-test is ran on, and a free port number (for example: NEURON_RT_ROOT_COMM_ID=10.0.0.1:44444). Additionally, either -s, --hosts needs to be provided or a ~/hosts file needs to exist - for more details refer to the -s,--hosts description below. |
-b, --minbytes |
32M | The starting size for the benchmark |
-e, --maxbytes |
32M | The end size for the benchmark. nccom-test will run benchmarks for all sizes between -b, --minbytes and -e, --maxbytes, increasing the size by either -i, --stepbytes or --f, --stepfactor with every run. |
-i, --stepbytes |
(--maxbytes - --minbytes) / 10 |
Amount of bytes with which to increase the benchmark's size on every subsequent run. For example, for this combination of arguments: -b 8 -e 16 -i 4, the benchmark will be ran for the following sizes: 8 bytes, 12 bytes, 16 bytes. |
-f, --stepfactor |
N/A | Factor with which to increase the benchmark's size on every subsequent run. For example, for this combination of argument values: -b 8 -e 32 -f 2, the benchmark will be ran for the following sizes: 8 bytes, 16 bytes, 32 bytes. |
-n, --iters |
20 | Number of Collective Communication operations to execute during the benchmark. |
-w, --warmup_iters |
5 | Number of Collective Communication operations to execute as warmup during the benchmark (which won't be counted towards the result). |
-d, --datatype |
uint8 |
Data type for the data used by the benchmark. Supported types: uint8, int8, uint16, int16, fp16, bf16, uint32, int32, fp32. Input data will be zero filled, unless --check is provided (currently, only available for --datatype fp32) in which case it will be filled by a repeated value of the requested type. |
-c, --check |
false | If provided, the correctness of the operations will be checked. This will not impact results (time, algbw and busbw) but will slightly increase the overall execution time. |
-s, --hosts |
N/A | Hosts on which to run execution. Checks ~/hosts if not specified. |
--non-interactive |
false | Do not display any animation or progress indicator |