CLUSTER KIT - a multifaceted node assessment tool for high performance clusters. Currently, ClusterKit is
capable of testing latency, bandwidth, effective bandwidth, memory bandwidth,
as well as bandwidth and latency between GPUs and local/remote memory.
Cluster Kit employs well known techniques and tests to arrive at these numbers and is intended
to give the user a general look at the health and performance of a cluster.


Output
------

Output for all tests will be placed in a separate directory for each run, which will be named by the time/date
information of the run.  Each test will output either a txt or csv file, and also a json file for further data analysis.


# output files all go to the same directory. Each file is tagged with time and data
export CK_OUTPUT_SUBDIR=""
# if unset, then the default, which is '%Y%m%d_%H%M%S' is used
# if set, but to "", then the output directory specified on the command line is used as is.

TS=$(date +"%Y%m%d-%H%M%S")
export CK_FILE_SUFFIX="_${TS}"
OUTDIR=$PROJ_DIR/clusterkit/sites/$CLUSTER/output/${TS}


Command Line Options
--------------------

Command line flags are parsed by the parse_options function, which fills an Options struct with the appropriate values.
Different tests have different options associated with them.

Further information on options found in each test's description.

Options denoted (PAIRWISE) only function or are relevant in pairwise tests.
Pairwise tests must be run on one process per node and have an even number of and at least two processes.

The default MODE is a full check of all pairs of nodes for pairwise tests. Only one mode may be active at a time.

  -q, --quick                           (PAIRWISE) MODE - checks only [number of nodes / 2] links, between the upper
                                                          and lower half of the allocated nodes.
  -f, --fromfile=<path>                 (PAIRWISE) MODE - specify which links to test in which rounds from a file
                                                          -- file format must be "[round] [node 1] [node 2]\n"
							                              for every link to be tested


					Example:        1 machine02 machine10
					                1 machine03 machine07
                                    2 machine02 machine03

This will test (machine02, machine10) and (machine03, machine07) in round 1 and (machine02, machine03) in round 2.

  -d, --testcase=<lat | bw | mb | beff_o | beff_or | gpumb | gpu_gpu_lat | gpu_gpu_bw | gpu_host_lat |
                  gpu_host_bw | gpu_neighbor_lat | gpu_neighbor_bw | noise | bisection_bw | nccl_bw | nccl_lat |
                  nccl_bcast | nccl_bcast | nccl_allreduce | nccl_reduce | nccl_allgather | nccl_reducescatter>

                                        specify test to be run.  You can include multiple tests by repeating
                                        -d or --testcase= multiple times on the command line
                                        -- default: the basic bw/lat/mb/effective bandwidth tests are run

  -D, --outputdir=<dir>                 select name of output directory -- default: date and time of run

  -a, --gpubwi=<iters>                  specify number of iterations for GPU bandwidth (GPU-GPU and GPU-host)
                                        -- default: 64 iterations

  -A, --gpubws=<size>                   specify size of GPU bandwidth (GPU-GPU and GPU-host) message in bytes
                                        -- default: 1 MB

  -b, --biters=<iters>                  specify number of iterations for bandwidth
                                        ("auto" for auto-determined iterations)
                                        -- default: 16 iterations

  -B, --bsize=<size>                    specify size of bandwidth message in bytes
                                        -- default: 32 MB

  -c, --recreatelog                     (PAIRWISE) creates a text file with all bad links to be run by a program to
                                        verify results (e.g. with ib_write_bw and ib_write_lat)

  -e, --beffi=<iters>                   specify number of iterations for effective bandwidth
                                        -- default: 512 iterations

  -E, --beffs=<size>                    specify size of effective bandwidth message in bytes
                                        -- default: 32 MB

  -h, --help                            help

  -i, --mbiters=<iters>                 specify number of iterations for memory bandwidth -- default: 16 iterations

  -I, --mbsize=<size>                   specify size of the memory bandwidth arrays in bytes
                                        -- default: 4 times the L3 cache size on the respective machine

  -j, --gpumbi=<iters>                  specify number of iterations for GPU to host memory bandwidth
                                        -- default: 16 iterations

  -J, --gpumbs=<size>                   specify size of GPU to host memory bandwidth array in bytes
                                        -- default: 0.01 times the available global GPU memory

  -k, --gpulati=<iters>                 specify number of iterations for GPU latency (GPU-GPU and GPU-host)
                                        -- default: 4096 iterations

  -K, --gpulats=<size>                  specify size of GPU latency (GPU-GPU and GPU-host) message in bytes
                                        -- default: 0 B

  -l, --liters=<iters>                  specify number of iterations for latency
                                        -- default: 4096 iterations

  -L, --lsize=<size>                    specify size of latency message in bytes
                                        -- default: 0 bytes

  -m, --memtest=<add/copy/scale/triad>  specify which memory bandwidth test you would like to run
                                        -- default: Triad

  -o, --output=<txt/csv>                txt or csv file output

  -p, --printmatrix                     print a colorful matrix of the results to the terminal

  -r, --recover=<path>                  (PAIRWISE) recover state and finish tests from a previous interrupted run,
                                        argument is directory of the previous run (will automatically search
                                        for recovery files)

  -s, --saveinterval=<interval>         (PAIRWISE) time interval (in seconds) at which the program will save state
                                        for later recovery -- default: 600 seconds

  -S, --scope_info=<path>               Specify location of scope information file.
                                        Format:  host,scope_name,scope_num

  -T, --tag=<tag> 			tags output json and text files with the given tag. Default is ""

  -t, --ltol=<tolerance>                specify tolerance multiplier (multiplied by the minimum) above which
                                        latency will be considered "bad" (must be at least 1)
                                        -- default: 1.6

  -u, --btol=<tolerance>                specify tolerance multiplier (multiplied by the maximum) below which bandwidth
                                        will be considered "bad" (greater than 0, less than or equal to 1)
                                        -- default: 0.95

  -v, --verbose                         verbose

  -x, --disable-retest                  (PAIRWISE) tests will only run the initial phase, no retesting.

  -y, --bycore                          (PAIRWISE) test will be run on all cores rather than by one process per node.

  -z, --bygpu                           GPU to GPU test will be run on each GPU separately.
                                        E.g., GPU0-GPU0 on all nodes, then GPU1-GPU1, etc.

  -N, --noise_iter=<iters>              specify number of iterations for system noise
                                        -- default: 10 iterations

  -M, --noise_histo=<size>              specify the size of the histogram
                                        -- default: 10000000

  -g, --noise_samples=<size>            specify the number of calibration samples
                                        -- default: 100000000

      --randomize_ranks                 randomize node relative ranks for bisectional_bw test

      --scope_order=<path>              for bisectional bandwidth test specify which scope pairs
                                        to test in which round -- file format must be
                                        "[round],[scope_name 1],[scope_name 2]"
                                        for every scope-to-scope link you wish to test.

      --with-stress[=types]             do hardware stresstesting while doing tests. types is
                                        a comma separated list of cpu,gpu,all (all is a default).
                                        Will generate file stress-data.csv with time series with
                                        GPU power and temperature and HCA temperature data.

      --topo=<file>                     specify topology data file to use in assesing link `badness`
                                        file format must be
                                        "[scope_name],[scope_num],[oversubscription_factor]"
                                        [scope_name] and [scope_num] are the same as in scope_info file,
                                        [oversubscription_factor] is a ratio (float) of total uplink
                                        bandwidth to total downlink bandwidth for a TOR switch (a single
                                        scope is a single TOR switch).

General Options and Features
----------------------------

Default run without any options will run a basic set of tests in FULL mode, with default message sizes and iterations.

Use the -h or --help flags for information about options.

Timing: MPI_Wtime and time.h are not precise enough for measuring short intervals, so a high resolution clock
is used instead. Considerable effort has been made to choose buffer size and iteration count defaults that
give reliable results, although if there is any doubt, one may always increase buffer sizes and number of iterations.

Rank 0 is responsible for orchestrating the entire testing process, for instance, delegating pairs of nodes and whether
they will be testing and/or collecting in a given round of testing.

Mode: either FULL, QUICK, or CUSTOM. These apply only to pairwise tests and refer to the testing scheme.

A FULL test will test every single possible pair of nodes.  For a system with n nodes, this is n * (n - 1) / 2 links.
FULL mode is the default and begins by testing each link, which repeats for  (n - 1) rounds.
When this is done, it assesses the links for poor performance, which may occur for a variety of reasons,
includng a congested link.  In this case two nodes may communicate be using paths with a shared link on the
fabric, which typically halves the bandwidth.  Any context switch from the scheduler could also adversely affect
performance.  In latency calculations, which are very short lived, this can have a measurable affect on the result.

Thus, to ensure that the link is actually deficient and not congested or otherwise normally operating, it must
be retested. In the second phase poorly performing links are split to batches that are tested one after another
in order to test nonconflicting links at a time.

If topology data is provided (--topo=<file> option), then oversubscription (using shared link by multiple
communication paths) of TOR switches is taken into account when assesing links for poor performance.

--topo=<file> option specifies topology data file to use in assesing link `badness`. File format must be
              "[scope_name],[scope_num],[oversubscription_factor]"
where [scope_name] and [scope_num] are the same as in scope_info file, [oversubscription_factor] is a ratio (float)
of total uplink bandwidth to total downlink bandwidth for a TOR switch (a single scope assumed to be a single TOR switch).

A QUICK test (-q, --quick) is a short, single round test, where the lower half of the nodes are tested with
the upper half of nodes, for a total of exactly n/2 links. E.g., in a system of 32 nodes, the links would be
 0-16, 1-17, 2-18, etc.  Poorly performing links are then retested individually.

CUSTOM mode (-f<path>, --fromfile=<path>) allows the user to specify the links to be tested.  They are taken from a text file.
This allows the user to specify exactly which links should be tested in each round. The program will test until it reaches the end of
the file and report the results. No retesting is done unless specified in the file.
The file format should be [round number] [node 1] [node 2].

Example: a file with the format below will test machine07-machine03 and machine09-machine02 in round 1, and machine02-machine03 in round 2.

1 machine07 machine03
1 machine09 machine02
2 machine02 machine03

Output: -o<txt/csv>, --output=<txt/csv>. Results will be printed either in plaintext or CSV format to the output text file.

Output directory: -D<dir>, --outputdir=<dir>. Output directory will be given argument. The default is the timestamp.

Tagging: -T<tag>, --tag=<tag>. Output files will be tagged with the given tag. Default will be tagged with "".

Data:  -d<lat/bw/mb/beffo/beffor/gpumb/gpugpulat/gpugpubw/gpuhostlat/gpuhostbw/noise/bisection_bw/nccl_bw/nccl_lat/nccl_bcast/nccl_allreduce/nccl_reduce/nccl_allgather/nccl_reducescatter>, --testcase=<lat/bw/mb/beffo/beffor/gpumb/gpugpulat/gpugpubw/gpuhostlat/gpuhostbw/noise/bisection_bw/nccl_bw/nccl_lat/nccl_bcast/nccl_allreduce/nccl_reduce/nccl_allgather/nccl_reducescatter>.
This option will select the tests to be run. It can be repeated as many times as desired to test any combination of the tests in one run.

	lat = latency
	bw = bandwidth
	mb = memory bandwidth
	beffo = ordered ring bandwidth
	beffor = random ring bandwidth (actually average of ordered and random rings)
	gpumb = GPU memory bandwidth
	gpugpulat = GPU - GPU latency
	gpugpubw = GPU - GPU bandwidth
	gpuhostlat = GPU - Host latency
	gpuhostbw = GPU - Host bandwidth
	bisection_bw = bisectional bandwidth
	nccl_bw = NCCL bandwidth
	nccl_lat = NCCL latency
	nccl_bcast = NCCL broadcast
	nccl_allreduce = NCCL AllReduce
	nccl_reduce = NCCL Reduce
	nccl_allgather = NCCL AllGather
	nccl_reducescatter = NCCL ReduceScatter

Recovery: -r<dir>, --recover=<dir>. Only available in pairwise testing.
This will continue from where a previous run was terminated. Throughout the testing the program will save the interim results into recovery files in a recovery
directory within the run directory. Two recovery files (cur/prev) are saved to, and if successfully written to will have a validation integer
at the end.  The program will first attempt to recover the latest file, then the previous. A recovery can thus only be performed on a FULL test terminated after the first
phase has finished, although it is recommended to wait until the second phase is over and the individual testing phase has begun, as the recovery skips immediately
to individual testing and the goal is to minimize individual testing as much as possible. This is to help with overly long testing times, so a run may be started,
terminated, and then continued later. If recovery files are not found, the program will continue to a FULL test as normal. Recovery files are cleaned up at the end of a successful run.

Save Interval: -s<interval>, --save-interval=<interval>. This is the interval in seconds at which the program will save state during the individual testing phase
for later recovery.

Verbose: -v, --verbose. Prints progress through the program to terminal.

Print matrix: -p, --printmatrix. Will print a colorcoded matrix of the results to terminal.
Not recommended for clusters above 128 nodes, as it will be unreadable.

Disable retesting: -x, --disable-retest. This will prevent FULL and QUICK mode pairwise tests from retesting.

By core: -y, --bycore. Tests that are typically run on one rank per node will be run by core instead. Not recommended, although if one does use this option,
be sure to adjust tolerance accordingly. Intranode communication is likely to have very different characteristics from internode communication. Does not
apply to GPU tests as multiple cores attempting to access the same GPU may cause errors.

Recreate log: -c, --recreatelog. For pairwise tests. This will generate text files per pairwise test in a recreate directory within the run directory, which will contain
a list of the bad links found. This may be used to verify the program's results again with another program.
An associated program, recreate.py, is included to specifically recreate the latency and bandwidth files generated, using ib_write_bw and ib_write_lat. Keep in mind ib_write_bw
is unidirectional while this test is bidirectional, so the value from Cluster Kit should be approximately twice the value of ib_write_bw.

Tolerance: a multiplier on the data extremum that every datum is compared to. In a pairwise test, whether data will be
marked "bad" is based on this, which will also impact whether data is retested.  For example, a tolerance of 1.6 with
higher_is_better marked as FALSE means that data greater than 1.6 times the minimum datum will be marked bad. Similarly,
higher_is_better marked as TRUE will mark data less than the tolerance times the maximum. Different tests have different
options to change their tolerance.

Stresstesting: --with-stress[=types] do hardware stresstesting while doing tests. types is a comma separated list of cpu,gpu,all (all is a default).
Will generate file stress-data.csv with time series with GPU power and temperature and HCA temperature data. Generating GPU data requires 'nvidia-smi'
command available. Generating HCA data requires running as root or be able to run 'sudo mget_temp' without a password or other prompts.


Latency
-------

Pairwise.
Communicator: only one rank per node performs this test in a separate "headrank" communicator.
All other ranks wait in Barrier, unless overridden with the --bycore command line arg.

Options:
Iterations: -l<iters>, --liters=<iters>. This will specify the number of iterations that the test will perform. The default is 4096.
Message Size: -L<size>, --lsize=<size>. This will specify the message size for the test. The default is 0 Bytes.
Tolerance: -t<tolerance>, --ltol=<tolerance>. This will specify the tolerance multiplier. The default is 1.6.

The latency test is performed with a series of MPI_Sends and MPI_Recvs. One partner sends a message of size <size> to the other,
and then the other partner sends a message back. This is repeated <iters> times.

Headrank communicator example: ranks marked o participate, ranks marked x do not.

Machine01	Machine 02	Machine03	Machine04
0 1 2 3 4	5 6 7 8 9	10 11 12 13 14	15 16 17 18 19
o x x x x	o x x x x	o  x  x  x  x	o  x  x  x  x


Bandwidth
----------

Pairwise.
Communicator: only one rank per node performs this test in a separate "headrank" communicator.
All other ranks wait in Barrier, unless overridden with the --bycore command line arg.

Options:
Iterations: -b<iters>, --biters=<iters>. This will specify the number of iterations that the test will perform. The default is 16. "auto" is also an acceptable option.
In this case, the program will automatically determine how low its iterations can be without falling below 0.995 of the result at 1024 iterations.
Message Size: -B<size>, --bsize=<size>. This will specify the message size for the test. The default is 8 MBytes.
Tolerance: -u<tolerance>, --btol=<tolerance>. This will specify the tolerance multiplier. The default is 0.95.

The bandwidth test is performed utilizing windows of nonblocking Isends and Irecvs. The default window size is 64, but will automatically reduce to the number
of iterations if iterations are lower than the window size. The iterations you input as an option will be divided by the window size to get the number of window iterations
that will be performed. Each partner posts a window of Isends, then posts a window of Irecvs, then Waitalls on all of them, and repeats.
This is the bidirectional bandwidth test.

Memory bandwidth
-----------------

Communicator: MPI_COMM_WORLD. Every rank performs this test.

Options:
Type of test: -m<add/copy/scale/triad>, --memtest=<add/copy/scale/triad>. Specifies the type of memory bandwidth test to perform. 'add' - adds two arrays into a third array.
'copy' - copies one array into another.  'scale' - multiplies one array by a scalar and stores the result in a second. 'triad' - performs all three.
'copy' and 'scale' use two arrays, while 'add' and 'triad' use three.
Iterations: -i<iters>, --mbiters=<iters>. This will specify the number of iterations that the test will perform. The default is 16.
Array size: -I<size>, --mbsize=<size>. This will specify the size of ONE of the three arrays.
The array is recommended to exceed the L3 cache size by at least a factor of 4. The default size is 4 times the L3 cache size.
Tolerance: uses bandwidth tolerance.

This memory bandwidth test is a variation on the STREAM memory bandwidth benchmark. The specific tests are described above in "Type of test".

Noise
------

Communicator: MPI_COMM_WORLD. Every rank performs this test.

Options:
Iterations:  -N, --noise_iter=<iters>. This will specify number of iterations for system noise. The default is 10.
Histogram Size:  -M, --noise_histo=<size>. This will specify the size of the histogram. The  default is 10000000.
# of Samples:  -g, --noise_samples=<size>. This will specify the number of calibration samples. The default is 100000000.
Tolerance: uses noise tolerance.

The noise test analyzes the gaps between consecutive samples of 'CPU cycles', which allows to detect the minimal number of cycles that are necessary for this process.
dividing the minimal # of cycles by the sampled # of cycles will provide the efficiency of the node which implies how 'noisy' the node is.
full explanation available at: [mellanox wiki](https://wikinox.mellanox.com/display/SWX/Clusterkit+Noise)

Effective Bandwidth
--------------------

Communicator: only one rank per node performs this test in a separate "headrank" communicator.
All other ranks wait in Barrier, unless overridden with the --bycore command line arg.

Options:
Iterations: -e<iters>, --beffi=<iters>. This will specify the number of iterations that the test will perform. The default is 512.
Message Size: -E<size>, --beffs=<size>. This will specify the message size for the test. The default is 32 MBytes.
Tolerance: uses bandwidth tolerance.

This is a variation on the HLRS effective bandwidth test. The ordered and ordered/random tests are implemented as separate tests and are run separately.
Rings of doubling size, starting at 2, up to the number of processes, are formed and messages passed in one direction on the rings. The ordered test, as the name
implies, creates the ring based on rank ordering. The random test only uses full sized rings, but with randomized order to stress the network. The results from
all rings sizes are averaged together with a log weighted average (geometric mean) in both tests. The two available tests are "ordered" and "ordered and random", the latter
of which takes the log weighted average of the ordered and random result.


GPU Memory Bandwidth
---------------------

Communicator: Only n ranks per node perform this test in a separate "GPU" communicator, where n is the number of GPUs each node has.
All other ranks wait in Barrier.

Options:
Iterations: -j<iters>, --gpumbi=<iters>. This will specify the number of iterations that the test will perform. The default is 16.
Message Size: -J<size>, --gpumbs=<size>. This will specify the message size for the test. The default is 0.01 times the GPU global memory.
Tolerance: uses bandwidth tolerance.

This tests the memory bandwidth to copy from a local GPU to the local host. It runs for each of the available GPUs on each node.

GPU communicator example: ranks marked x do not participate.

Each machine has 2 GPUs. Those ranks participating are assigned one of the GPUs. (0 = GPU0, 1 = GPU1)
If there are conflicting amounts of GPUs on each node, the program will test the minimum number of GPUs found.

Machine01	Machine 02	Machine03	Machine04
0 1 2 3 4	5 6 7 8 9	10 11 12 13 14	15 16 17 18 19
0 1 x x x	0 1 x x x	0  1  x  x  x	0  1  x  x  x



GPU to GPU Latency
--------------------

Pairwise.
Communicator: Only n ranks per node perform this test in a separate "GPU" communicator, where n is the number of GPUs each node has.
All other ranks wait in Barrier. This may be overridden (see below).

Options:
Iterations: -k<iters>, --gpulati=<iters>. This will specify the number of iterations that the test will perform. The default is 4096.
Message Size: -K<size>, --gpulats=<size>. This will specify the message size for the test. The default is 0 bytes.
By GPU: -z, --bygpu. Rather than use the above communicator, multiple tests will be run, one for each GPU, where each test uses a
headrank communicator. E.g., every node will set GPU0 as their CUDA device and run a GPU0 - GPU0 test, then GPU1, etc.
Tolerance: uses latency tolerance.

This test is identical in design to the normal latency test, but utilizes CUDA GPUDirect RDMA to send and receive MPI messages.


GPU to Host Latency
--------------------

Pairwise.
Communicator: only one rank per node performs this test in a separate "headrank" communicator. One test for each GPU is run,
e.g. GPU0 to Host, GPU1 to Host, etc. All other ranks wait in Barrier.

This test is identical in design and options (except the bygpu option is not available) to the GPU-GPU latency test. The only
difference is that one of the partner ranks will allocate a host buffer rather than a device buffer.


GPU Neighbor Latency
--------------------


Options:
Iterations: -k<iters>, --gpulati=<iters>. This will specify the number of iterations that the test will perform. The default is 4096.
Message Size: -K<size>, --gpulats=<size>. This will specify the message size for the test. The default is 0 bytes.
Communicator: only one rank per node performs this test in a separate "headrank" communicator.

This is a more limited scope GPU test for a quick diagnosis of the GPUDirect RDMA capability. Ordered rings of size 2 will be
formed (0-1, 2-3, 4-5) and within each pair, all GPU combinations will be tested. For example, if node 0 and 1 have 4 GPUs each,
there will be 16 test results (4x4 GPUs) for that pair.


GPU to GPU Bandwidth
--------------------

Pairwise.
Communicator: Only n ranks per node perform this test in a separate "GPU" communicator, where n is the number of GPUs each node has.
All other ranks wait in Barrier. This may be overridden (see below).

Options:
Iterations: -a<iters>, --gpubwi=<iters>. This will specify the number of iterations that the test will perform. The default is 64.
Message Size: -A<size>, --gpubws=<size>. This will specify the message size for the test. The default is 1 Mbyte.
By GPU: -z, --bygpu. Rather than use the above communicator, multiple tests will be run, one for each GPU, where each test uses a
headrank communicator. E.g., every node will set GPU0 as their CUDA device and run a GPU0 - GPU0 test, then GPU1, etc.
Tolerance: uses bandwidth tolerance.

This test is identical in design to the normal bandwidth test, but utilizes CUDA GPUDirect RDMA to send and receive MPI messages.


GPU to Host Bandwidth
----------------------

Pairwise.
Communicator: only one rank per node performs this test in a separate "headrank" communicator. One test for each GPU is run,
e.g. GPU0 to Host, GPU1 to Host, etc. All other ranks wait in Barrier.

This test is identical in design and options (except the bygpu option is not available) to the GPU-GPU bandwidth test. The only
difference is that one of the partner ranks will allocate a host buffer rather than a device buffer.


GPU Neighbor Bandwidth
----------------------

Options:
Iterations: -a<iters>, --gpubwi=<iters>. This will specify the number of iterations that the test will perform. The default is 64.
Message Size: -A<size>, --gpubws=<size>. This will specify the message size for the test. The default is 1 Mbyte.
Communicator: only one rank per node performs this test in a separate "headrank" communicator.

This is a more limited scope GPU test for a quick diagnosis of the GPUDirect RDMA capability. Ordered rings of size 2 will be
formed (0-1, 2-3, 4-5) and within each pair, all GPU combinations will be tested. For example, if node 0 and 1 have 4 GPUs each,
there will be 16 test results (4x4 GPUs) for that pair.

The user interface will show only one value of the (n x n) matrix of GPU-to-GPU results
A set of environment variables determine which data point in the matrix is displayed.
It can be: worst, best, avg, or select, in which the user specifies the (x,y) indices into the matrix.
If not otherwise set, the default is to show the 'worst' case, which means highest latency, or lowest bandwidth.

export GPU_NEIGHBOR_MODE=[WORST|BEST|AVG|SELECT]
export GPU_NEIGHBOR_DATA_X=3
export GPU_NEIGHBOR_DATA_Y=2

Note that all of the above also applies to the GPU Neighbor latency test.


Bisectional bandwidth
----------------------

The bisectional bandwith test measures total bandwidth between pairs of scopes as defined in scope_info (--scope-info switch) file. All scopes should contain equal number
of nodes. Each node in a scope measures bandwidth to the corresponding node in another scope and sum of all
bandwidth beween a pair of nodes is reported. The test is performed between all pairs of scopes.

The bisectional bandwidth test is performed utilizing windows of nonblocking Isends and Irecvs. The default window size is 64, but will automatically reduce to the number
of iterations if iterations are lower than the window size. The iterations you input as an option will be divided by the window size to get the number of window iterations
that will be performed. Each partner posts a window of Isends, then posts a window of Irecvs, then Waitalls on all of them, and repeats.
This is the bidirectional bandwidth test.

Scripts for preparing scope_info and scope_order files are in scripts/scopes directory, see README.md there.

NCCL bandwidth
----------------------

Pairwise.
Communicator: Only n ranks per node perform this test in a separate "GPU" communicator, where n is the number of GPUs each node has.
All other ranks wait in Barrier.

Options:
Iterations: -a<iters>, --gpubwi=<iters>. This will specify the number of iterations that the test will perform. The default is 64.
Message Size: -A<size>, --gpubws=<size>. This will specify the message size for the test. The default is 1 Mbyte.
Tolerance: uses bandwidth tolerance.

The NCCL bandwidth test is similar to GPU to GPU bandwidth test except that it uses NCCL (NVIDIA Collective Communications Library) to make communications.

To achieve better performance use --mapper option of clusterkit.sh script (see core_to_hca_dgx.sh below).

This test is in Alpha quality now, its results require further validation.

NCCL latency
----------------------

Pairwise.
Communicator: Only n ranks per node perform this test in a separate "GPU" communicator, where n is the number of GPUs each node has.
All other ranks wait in Barrier.

Options:
Iterations: -k<iters>, --gpulati=<iters>. This will specify the number of iterations that the test will perform. The default is 1024.
Message Size: -K<size>, --gpulats=<size>. This will specify the message size for the test. The default is 8 bytes.
Tolerance: uses latency tolerance.

The NCCL latency test is similar to GPU to GPU latency test except that it uses NCCL (NVIDIA Collective Communications Library) to make communications.

To achieve better performance use --mapper option of clusterkit.sh script (see core_to_hca_dgx.sh below).

This test is in Alpha quality now, its results require further validation.


==============================================================================================

Additional Scripts
------------------


clusterkit.sh
-------------

*running script

./bin/clusterkit.sh [options] <parameters>

        Parameters:
                -f|--hostfile <hostfile>        File with newline separated hostnames to run tests on.
                -r|--hpcx_dir <path>            Path to HPCX installation root folder (or use env HPCX_DIR)

        Options:
                -p|--ppn <number>               Select number of processes per hostname (default: 1)
                -l|--hca_list "string"          Comma separated list of HCAs to use (default: autoselect)
                -t|--transport_list "string"    List of RDMA transports to use (rc,dc,ud) (default: autoselect best)
                -s|--ssh                        Use ssh for process launching (default: autoselect)
                -h|--help                       Show help message
                -n|--dry-run                    Dry run (do nothing, only print)

        Examples:
                % ./scripts/clusterkit.sh --ssh --hostfile hostfile.txt

                % ./scripts/clusterkit.sh --hca_list "mlx5_0:1,mlx5_2:1" --hostfile hostfile.txt



core_to_hca_dgx.sh
------------------
Testing machines with multiple HCAs

In-order to test the performance between all combinations of HCAs in the target nodes,
it is necessary to run a process for every HCA and to set the process affinity to the same NUMA node as the HCA.

Example for dgx machines:
The following script should be run by the mpirun command line and it receives the executable and parameters:

clusterkit.sh --hostfile dgx.nodes --hpcx_dir $hpcx_dir --ppn 4 --mapper core_to_hca_dgx.sh --transport_list dc --bycore
