Optimization of collective communications in the distributed calculation | Cpu hardware list and their functions | Hardware devices | Computer hardware | Turtles AI

Optimization of collective communications in the distributed calculation
As Nvidia Sharp improves the efficiency of network operations for the AI ​​and scientific calculation
Isabella V27 October 2024

 

 

The evolution of collective communication technologies, such as NVIDIA SHARP, has revolutionized distributed computing, optimizing network operations for AI and scientific computing. Performance has improved dramatically, enabling unprecedented scalability.

Key Points:

  • NVIDIA SHARP has transformed collective communication in distributed computing, improving efficiency.
  • SHARP reduces the data load on servers by offloading operations to network switches.
  • Later versions of SHARP have increased scalability and support for complex AI workloads.
  • SHARPv4 will bring significant innovations in collective communications for AI training applications.

In today’s AI and scientific computing landscape, AI applications and complex scientific computations present challenges that require innovative approaches to distributed computing. As the tasks required exceed the capabilities of a single machine, it is critical to break down tasks into parallel tasks distributed across a large number of compute engines, such as CPUs and GPUs. This approach allows massive workloads to be handled, including training data and model parameters, thus optimizing performance.

The key to effective scaling lies in communication between nodes, which must frequently share critical information, such as gradients computed during the backpropagation phase in model training. Such operations require collective communications, such as all-reduce, broadcast and gather and scatter, to ensure that model parameters are synchronized and converge properly. However, the efficiency of these operations is crucial: if handled ineffectively, they can cause bottlenecks that compromise the scalability of the system.

Bottlenecks arise from several factors, including latency and bandwidth limitations, which affect how quickly nodes can communicate with each other. As the scale of the system increases, the volume of data to be transferred grows, and communication times begin to exceed computation times. In addition, synchronization overhead can slow down the entire system, as slower nodes can delay operations. This is compounded by network contention, which increases as the number of nodes communicating at the same time grows, exacerbating bandwidth and available resource problems.

To address these challenges, advances in network technologies and algorithmic optimizations are needed. It is in this context that NVIDIA SHARP, a protocol designed to improve collective communications in distributed systems, emerges. This innovative approach, known as networked processing, shifts the management of collective communications from processors to network switches, significantly reducing data load and minimizing latency variations, commonly known as jitter.

The first generation of SHARP was introduced with NVIDIA InfiniBand EDR networks at 100 Gb/s, focusing on small reduction operations, and found immediate support in Message Passing Interface (MPI) libraries. Tests conducted on the Texas Advanced Computing Center’s Frontera supercomputer demonstrated improved performance, with up to a nine-fold increase for some collective operations. With the advent of the second generation, SHARPv2 expanded its capabilities, introducing support for AI workloads and further improving scalability with reduction operations on larger messages.

SHARPv2 offered significant benefits, as evidenced by performance in training models such as BERT, where there was a 17 percent improvement over previous versions. With the introduction of SHARPv3, the technology took another step forward, supporting multiple AI workloads in parallel, dramatically reducing latency in AllReduce operations. This has also been demonstrated in cloud computing contexts, showing significant performance benefits.

Today, SHARP’s integration with the NVIDIA Collective Communication Library (NCCL) has created a powerful synergy for distributed deep learning workloads, eliminating the need to continuously copy data between GPUs and improving overall efficiency. With the arrival of SHARPv4, the technology is poised to implement new algorithms to support an even wider range of collective communications, marking a new era in AI training applications.

SHARP is not just a technology but is the force fueling progress and competitiveness in the fields of scientific computing and AI.