Skip to content

Latest commit

 

History

History
36 lines (29 loc) · 15.6 KB

File metadata and controls

36 lines (29 loc) · 15.6 KB

NCCOM Tests

nccom-test is a benchmarking tool for quickly evaluating the performance of Collective Communication operations on one or more Neuron instances (it is compatible with both trn1 and inf2 instance types) or just for a fast sanity check of the environment before attempting to run a more complex workload.

Understanding nccom-test output

The command will output a table containing several columns containing performance metrics. There will be a line for every requested data size (by default the data size is 32MB as seen in the previous example).

Column name Description
size (B) Size in bytes for the data involved in this operation
count(elems) Number of elements in the data involved in this operation. For example, if size(B) is 4 and type is fp32, then count will be 1 since one single fp32 element has been processed.
type Data type for the processed data. Can be: uint8, int8, uint16, int16, fp16, bf16, int32, uint32, fp32
time(us) Time in microseconds representing the P50 of all durations for the Collective Communication operations executed during the benchmark.
algbw(GB/s) Algorithm bandwidth in gibibytes (1GiB = 1,073,741,824 bytes) per second which is calculated as size(B)/time(us)
busbw(GB/s) Bus bandwidth - bandwidth per data line in gibibytes per second - it provides a bandwidth number that is independent from the number of ranks (unlike algbw). For a more in-depth explanation on bus Bandwidth, please refer to NVIDIA’s nccl-tests documentation.
Avg bus bandwidth Average of the values in the busbw column

CLI arguments

Argument Default value Description
<cc operation> N/A, required argument The type of Collective Communication operation to execute for this benchmark. Supported types: - all_reduce / allr: All-Reduce - all_gather / allg: All-Gather - reduce_scatter / redsct: Reduce-Scatter - sendrecv: Send-Receive - alltoall: All-to-All
-r, --nworkers N/A, required argument Total number of workers (ranks) to use
-N, --nnodes 1 Total number of nodes (instances) to use. The number of workers will be divided equally across all nodes. If this argument is greater than 1, the NEURON_RT_ROOT_COMM_ID environment variable needs to be set to the host address of the instance nccom-test is ran on, and a free port number (for example: NEURON_RT_ROOT_COMM_ID=10.0.0.1:44444). Additionally, either -s, --hosts needs to be provided or a ~/hosts file needs to exist - for more details refer to the -s,--hosts description below.
-b, --minbytes 32M The starting size for the benchmark
-e, --maxbytes 32M The end size for the benchmark. nccom-test will run benchmarks for all sizes between -b, --minbytes and -e, --maxbytes, increasing the size by either -i, --stepbytes or --f, --stepfactor with every run.
-i, --stepbytes (--maxbytes - --minbytes) / 10 Amount of bytes with which to increase the benchmark's size on every subsequent run. For example, for this combination of arguments: -b 8 -e 16 -i 4, the benchmark will be ran for the following sizes: 8 bytes, 12 bytes, 16 bytes.
-f, --stepfactor N/A Factor with which to increase the benchmark's size on every subsequent run. For example, for this combination of argument values: -b 8 -e 32 -f 2, the benchmark will be ran for the following sizes: 8 bytes, 16 bytes, 32 bytes.
-n, --iters 20 Number of Collective Communication operations to execute during the benchmark.
-w, --warmup_iters 5 Number of Collective Communication operations to execute as warmup during the benchmark (which won't be counted towards the result).
-d, --datatype uint8 Data type for the data used by the benchmark. Supported types: uint8, int8, uint16, int16, fp16, bf16, uint32, int32, fp32. Input data will be zero filled, unless --check is provided (currently, only available for --datatype fp32) in which case it will be filled by a repeated value of the requested type.
-c, --check false If provided, the correctness of the operations will be checked. This will not impact results (time, algbw and busbw) but will slightly increase the overall execution time.
-s, --hosts N/A Hosts on which to run execution. Checks ~/hosts if not specified.
--non-interactive false Do not display any animation or progress indicator