Distributed training on Multiple servers with Nvidia GPUs and Huggingface Trainer


NCCL

As a machine learning enthusiast, you’re probably no stranger to the arduous process of training models. The time and computational resources required to achieve the ideal model can be significant obstacles. But what if there was a way to speed up the process without the need for a powerful GPU server? Enter distributed training with multiple GPUs. In this tutorial, we’ll show you how to utilize multiple GPU servers with Nvidia GPUs to train your models faster than ever before using the Nvidia Collective Communications Library (NCCL) and Hugging Face Trainer. Whether you’re a seasoned ML expert or just starting out, this post will provide valuable insights and practical tips to help you optimize your training process and take your models to the next level. Get ready to dive into the world of distributed training and unlock the power of multiple GPUs!

Prerequisites:

Python NVIDIA GPUs (with CUDA support) PyTorch

The NVIDIA Collective Communications Library (NCCL)

The NVIDIA Collective Communications Library (NCCL) is a backend optimized for NVIDIA GPUs and is commonly used for distributed training with PyTorch.

Please Note: The NCCL backend is only applicable when used with CUDA (Nvidia GPUs)

Other examples of distributed training backends are Gloo and MPI

Installing NCCL

  1. Visit the NCCL download page.

  2. Download the .tar xz (txz) file, which is OS-agnostic.

  3. Extract the file. For NCCL 2.17.1 and CUDA 11.x, the tar file name is nccl_2.17.1-1+cuda11.0_x86_64.txz.

     tar -xvf nccl_2.17.1-1+cuda11.0_x86_64.txz
    
  4. Copy the extracted files to /usr/local

    sudo mkdir -p /usr/local/nccl-2.1
    sudo cp -vRf nccl_2.17.1-1+cuda11.0_x86_64/* /usr/local/nccl-2.1
    
  5. Update LD_LIBRARY_PATH to include the NCCL library

     cd ~
     echo 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/nccl-2.1/lib' >> ~/.bashrc
     source .bashrc
    
  6. Create a symbolic link to the header file

     sudo ln -s /usr/local/nccl-2.1/include/nccl.h /usr/include/nccl.h
    
  7. Specify the network interface that should be used by NCCL

    You can use ‘ifconfig’ command to find out your network interface.

     echo 'export NCCL_SOCKET_IFNAME=eth0' >> ~/.bashrc
    
  8. Update the /etc/hosts file with the IP addresses of the GPU instances

     sudo vi /etc/hosts
    
     192.1.2.3 gpu-instance-1
     192.1.2.4 gpu-instance-2
    
  9. Set environment variables for each instance

    Instance 1:

     export MASTER_ADDR=gpu-instance-1
     export MASTER_PORT=9494
     export LOCAL_RANK=0
     export HOST_NODE_ADDR=192.1.2.3:29400
     export WORLD_SIZE=2
    

    Instance 2:

     export MASTER_ADDR=gpu-instance-1
     export MASTER_PORT=9494
     export LOCAL_RANK=0
     export HOST_NODE_ADDR=192.1.2.3:29400
     export WORLD_SIZE=2
    
  10. Open the necessary ports

  11. Start training on each instance using torchrun

    Please note that torch.distributed.launch is deprecated.

    Instance 1:

    torchrun --nnodes=2 --nproc_per_node=1 --node_rank=0 --rdzv_id=456 --rdzv_endpoint=$HOST_NODE_ADDR train.py
    

    Instance 2:

    torchrun --nnodes=2 --nproc_per_node=1 --node_rank=1 --rdzv_id=456 --rdzv_endpoint=$HOST_NODE_ADDR train.py
    
    

For more information on using torchrun, refer to the PyTorch Elastic Run documentation

Now let us look at a a sample trainer.py script that uses the Hugging Face Trainer for distributed training. We will use the BERT model for a simple text classification task as an example.

First, install the required packages:

pip install transformers datasets

trainer.py

import os
import torch
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
from datasets import load_dataset

def main():
    # Load the dataset
    dataset = load_dataset("ag_news", split="train")
    dataset = dataset.map(lambda e: e, batched=True, batch_size=len(dataset))
    labels = dataset.features["label"].names

    # Tokenize the dataset
    tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
    dataset = dataset.map(lambda e: tokenizer(e["text"], truncation=True, padding="max_length", max_length=512), batched=True)
    dataset.set_format("torch", columns=["input_ids", "attention_mask", "label"])

    # Load the pre-trained BERT model for sequence classification
    model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=len(labels))

    # Define training arguments
    training_args = TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=8,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        logging_dir="./logs",
        logging_strategy="epoch",
        report_to="none",
        fp16=True,
        load_best_model_at_end=True,
        metric_for_best_model="accuracy",
        greater_is_better=True,
        gradient_accumulation_steps=1,
        dataloader_num_workers=4,
        run_name="distributed-training-sample",
        local_rank=os.environ.get("LOCAL_RANK", -1),
    )

    # Define the metric for evaluation
    def compute_metrics(eval_pred):
        logits, labels = eval_pred
        predictions = torch.argmax(logits, dim=-1)
        return {"accuracy": (predictions == labels).sum().item() / len(labels)}

    # Create the Trainer
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=dataset,
        eval_dataset=None,
        tokenizer=tokenizer,
        compute_metrics=compute_metrics,
    )

    # Train the model
    trainer.train()

if __name__ == "__main__":
    main()

In this trainer.py script, we use the Hugging Face Trainer to train a BERT model on the AG News dataset for text classification. The script first loads and tokenizes the dataset, then initializes a pre-trained BERT model for sequence classification. The training arguments are set up, including specifying the local rank for distributed training. The script defines a custom metric (accuracy) for evaluation, creates a Trainer instance, and finally trains the model.

the provided trainer.py script can be used with torchrun and NCCL for distributed training. The script has already taken into account the local rank for distributed training using the LOCAL_RANK environment variable. You can use the torchrun commands as described in the previous steps of the tutorial to start the distributed training. That will launch the distributed training process on both instances, and the Hugging Face Trainer in the trainer.py script will handle the necessary setup and synchronization for distributed training using NCCL as the backend.

Once you’ve set up the environment and started the distributed training using the trainer.py script and torchrun, you can expect to see output similar to the following on each instance:

Epoch 1/3: 100%|██████████| 15000/15000 [1:30:45<00:00, 2.75it/s, loss=0.271, accuracy=0.904]
Saving model checkpoint to ./results/checkpoint-15000
Configuration saved in ./results/checkpoint-15000/config.json
Model weights saved in ./results/checkpoint-15000/pytorch_model.bin
Epoch 2/3: 100%|██████████| 15000/15000 [1:30:45<00:00, 2.75it/s, loss=0.137, accuracy=0.951]
Saving model checkpoint to ./results/checkpoint-30000
Configuration saved in ./results/checkpoint-30000/config.json
Model weights saved in ./results/checkpoint-30000/pytorch_model.bin
Epoch 3/3: 100%|██████████| 15000/15000 [1:30:45<00:00, 2.75it/s, loss=0.091, accuracy=0.969]
Saving model checkpoint to ./results/checkpoint-45000
Configuration saved in ./results/checkpoint-45000/config.json
Model weights saved in ./results/checkpoint-45000/pytorch_model.bin

The training progress will be displayed as the training proceeds through the epochs. The loss and accuracy metrics will be logged, and model checkpoints will be saved to the ./results directory after each epoch.

What can you do next:

  1. Evaluate the model: Once the training is complete, you can load the best checkpoint (in this case, the one with the highest accuracy) and evaluate the model on the validation or test dataset. You can modify the trainer.py script to add the evaluation code or create a separate script for evaluation.

  2. Fine-tune hyperparameters: You can experiment with different hyperparameters, such as learning rate, batch size, and number of training epochs, to improve the model’s performance.

  3. Try different models: You can experiment with different pre-trained models available in the Hugging Face Transformers library, such as GPT-2, RoBERTa, or DistilBERT, for your specific task.

  4. Scale up distributed training: You can further scale up your distributed training by increasing the number of instances or by using multiple GPUs per instance. Adjust the torchrun command’s –nnodes and –nproc_per_node arguments accordingly.

  5. Monitor training: To monitor the training process more effectively, you can use tools like TensorBoard or Weights & Biases. These tools can help you visualize training metrics and keep track of experiments.

Now let us look at some of the Pros and Cons of Distributed training.

Advantages of distributed training:

  1. Faster training: Distributed training allows you to train your model on multiple GPUs or multiple machines with GPUs, which can significantly reduce the time required to complete the training process, especially for large models and datasets.

  2. Scalability: Distributed training enables you to scale up your training process by adding more resources, such as GPU instances or multiple GPUs per instance, to accommodate larger models and datasets.

  3. Fault tolerance: Some distributed training frameworks provide fault tolerance, which means that if a node fails during training, the training process can continue on the remaining nodes without significant disruption.

  4. Resource utilization: With distributed training, you can better utilize the available computational resources, such as GPUs, across multiple machines. This can lead to more efficient use of resources and reduced idle times.

  5. Collaborative learning: In some cases, distributed training allows multiple parties to collaboratively train a model while keeping their data private, thus enabling collaborative learning without compromising data privacy.

Disadvantages of distributed training:

  1. Complexity: Setting up distributed training can be more complicated than training on a single machine, as it requires additional setup, configuration, and management of multiple instances, networking, and synchronization.

  2. Communication overhead: Distributing the training process across multiple machines introduces communication overhead, as the machines need to exchange gradients, weights, or other information during training. This overhead can negatively impact the overall training speed and efficiency. Also if you are in a restricted network with added security and firewalls it is very difficult to achieve proper communication between nodes.

  3. Diminishing returns: Although distributed training can speed up the training process, there might be diminishing returns as you add more GPUs or nodes. The increased communication overhead and synchronization costs can counteract the benefits of parallelization.

  4. Hardware requirements: To fully take advantage of distributed training, you need access to multiple machines with GPUs or other accelerators, which can be costly and may not be readily available for all users.

  5. Debugging and monitoring: Debugging and monitoring the training process can be more challenging in a distributed setting compared to a single machine, as issues may arise due to synchronization, network latency, or other distributed training-related factors.

In conclusion, distributed training is a powerful technique for accelerating the training process of deep learning models, particularly for large models and datasets. By leveraging multiple GPUs or multiple machines with GPUs, you can significantly reduce training time, improve resource utilization, and enable collaborative learning while preserving data privacy. Despite the challenges posed by increased complexity, communication overhead, and hardware requirements, distributed training remains an essential tool in the field of deep learning.

In this tutorial, we demonstrated how to set up distributed training using the Hugging Face Trainer, NVIDIA Collective Communications Library (NCCL), and PyTorch’s torchrun for a text classification task. By following these steps and understanding the advantages and disadvantages of distributed training, you can adapt this approach to your specific use cases and harness the full potential of distributed training for your deep learning projects. As you continue to explore distributed training, consider experimenting with different models, hyperparameters, and monitoring tools to further enhance your model’s performance and gain valuable insights into the training process.