"Horovod is a distributed training framework for TensorFlow, Keras, PyTorch, and MXNet. The goal of Horovod is to make distributed Deep Learning fast and easy to use. The primary motivation for this project is to make it easy to take a single-GPU TensorFlow program and successfully train it on many GPUs faster."
Quote from Horovod Github documentation.
Installation
Please follow the link for general instructions on installing Horovod for use with GPUs. The commands below assume a Bourne type shell; if you are using a C type shell then the "source activate" command may not work; in general, you can load all the modules, define any environment variables, and then type "bash" and execute the other commands.
Step 1: Install NCCL 2
Please download NCCL 2 from https://developer.nvidia.com/nccl (select OS agnostic local installer; Download NCCL 2.7.8, for CUDA 10.2, July 24,2020 was used in the latest test of this recipe).
Add the nccl library path to LD_LIBRARY_PATH
environment variable
$ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:Path_to_nccl/nccl-<version>/lib
Step 2: Install horovod python package
module load python/3.6-conda5.2
Create a local python environment for a horovod installation with nccl and activate it
conda create -n horovod-withnccl python=3.6 anaconda source activate horovod-withnccl
Install a GPU version of tensorflow or pytorch
pip install https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.10.0-cp36-cp36m-linux_x86_64.whl
Load mvapich2 and cuda modules
module load gnu/7.3.0 mvapich2-gdr/2.3.4 module load cuda/10.2.89
Install the horovod python package
HOROVOD_NCCL_HOME=/path_to_nccl_home/ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir horovod
Testing
Please get the benchmark script from here.
#!/bin/bash #SBATCH --job-name R_ExampleJob #SBATCH --nodes=2 --ntasks-per-node=48 #SBATCH --time=01:00:00 #SBATCH --account <account> module load python/3.6-conda5.2 module load cuda/10.2.89 module load gnu/7.3.0 module load mvapich2-gdr/2.3.4 source activate horovod-withnccl export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/path_to_nccl_home/lib mpiexec -ppn 1 -binding none -env NCCL_DEBUG=INFO python tf_cnn_benchmarks.py
Feel free to contact OSC Help if you have any issues with installation.
Publisher/Vendor/Repository and License Type
https://eng.uber.com/horovod/, Open source