The NVIDIA Collective Communications Library (NCCL) provides communication APIs for low-latency and high-bandwidth collectives, enabling AI workloads to scale from just a few GPUs on a single host to thousands of GPUs in a data center. This post discusses NCCL features that support run-time rescaling for cost optimization, as well as minimizing service downtime from faults by dynamically removing faulted workers.
Enabling scalable AI with NCCL
NCCL was introduced in 2015 to accelerate AI training using more than one GPU to train the model together. Over the next decade, training workloads have expanded to thousands of GPUs, and new models continue to increase in size and complexity.
Today, both training and inference workloads rely on multi-GPU collectives that combine data parallelism, tensor parallelism, and expert parallelism to meet latency and throughput goals. NCCL collectives continue to form the communication backbone for these strategies, synchronizing computation across multiple workers (known as ranks) within a communicator.
Typically, a deep learning framework will perform a single initialization step at launch time to determine data sharding and assign each GPU their specific tasks in multiple dimensions of parallelism. However, as the model size and the need for parallelism in these inference engines increases, dynamically reallocating resources at runtime becomes attractive for minimizing operational footprint.
A dynamically scalable inference engine can respond to increased user traffic by allocating additional GPUs and spreading the work across them, or relinquishing excess GPUs when traffic is low in order to optimize cost. These are examples of planned scaling events in which all parts of the system are working as designed. We’ll show that this pattern is useful for fault tolerance as well.
Figure 1. An inference cluster experiences increased traffic, which may impact response latency. The framework allocates two additional workers which join the communicator to share the load
How NCCL communicators enable dynamic application scaling
NCCL communicators were heavily inspired by MPI communicators. However, NCCL introduced important differences and new concepts to enable dynamic application scaling.
NCCL communicators can be created from scratch by the application at any point during execution by passing a uniqueId to ncclCommInit. In contrast, MPI creates a special communicator called MPI_COMM_WORLD during initialization, and all other communicators are subsets created with MPI_Comm_split. NCCL communicators can be configured to be non-blocking so that initialization functions may continue in the background. In NCCL, the application chooses the assignment of ranks to communicator members, allowing applications to optimize the communicator layout.Once a communicator is created, the set of members (ranks) is considered immutable. Therefore a NCCL application performing a scale-up operation executes a sequence much like a second initialization. A new uniqueId is obtained and shared across all ranks who pass it to ncclCommInit. An optimized application may enable nonblocking mode to let the initialization work proceed in the background while continuing to process requests using the old communicator until the new one is ready.
Similarly, a scale-down can be implemented the same way using ncclCommInit, or the application can call ncclCommShrink, which has been optimized to reduce initialization time by re-using rank information from the old communicator. This optimization is particularly useful for very large communicators, but also provides a simplified API at any scale.
Fault-tolerant NCCL applications
Fault detection, attribution, and mitigation encompass a complex topic that spans the entire application stack from physical layers up to application layers. To learn more about faults and checkpoint recovery, see Ensuring Reliable Model Training on NVIDIA DGX Cloud. To learn more about observability and fault-tolerance improvements in Dynamo 0.4, see Dynamo 0.4 Delivers 4x Faster Performance, SLO-Based Autoscaling, and Real-Time Observability.
In addition to traditional checkpointing and load-balancing fault mitigation techniques, NCCL communicators can be dynamically resized after a fault allowing recovery within the application without fully restarting the workload.
Popular methods for deploying inference workloads (such as Kubernetes) already provide mechanisms for re-launching replacement workers, but the application must also initiate fault-mitigation steps for the NCCL communicator as well. Recovering from a fault contained to a subset of ranks is similar to a scale-down procedure in which the ranks are removed from the communicator.
The difference is that even healthy ranks should expect NCCL to either return an error or hang on any collective operation. Typical recovery for the healthy ranks starts with ncclCommAbort on the existing communicator, followed by ncclCommInit to form a new communicator with the surviving ranks.
Figure 2. Faulted workers prevent inference from being completed. Fault mitigation removes the workers, and allows the healthy workers to continue accepting requests
NCCL 2.27 introduced ncclCommShrink, which is an optimization and simplification to this recovery process. When passed the NCCL_SHRINK_ABORT flag and a list of which ranks to exclude, ncclCommShrink cancels any hung operations, and creates a new communicator without the need to call ncclGetUniqueId or ncclCommInit.
Dynamic-scaling and fault-tolerant application example
Using these concepts, you can build a simple example of a NCCL application which can respond to scaling requests from the framework:
This example is modeled on a distributed inference application and demonstrates how a framework can direct workers to perform scale-up or scale-down operations. The core logic is captured in two key functions: scaleCommunicator and shrinkCommunicator. These are invoked by the framework as needed. The primary inference work is handled by executePrefillAndDecode, which uses an active communicator that can be replaced over the worker’s lifetime.
The application is built around a central mainLoop that represents the continuous work of an inference worker. On each iteration, the worker gets new tasks from the framework and checks for a scalingFlag that signals if a resizing operation should occur. The framework ensures that these scaling requests are delivered synchronously to all workers. In the event of a fault, a worker will either time out or receive an error from NCCL. In either scenario, the exception handling path notifies the framework, prompting a fault recovery to begin.
Coordinated actions among workers require a central monitoring component, which we can call an Application Monitor. This component is typically responsible for tracking worker health, traffic load, and request latency. Based on these metrics, the Application Monitor signals the workers when to scale the pool up or down.
To handle increased traffic, for example, the Application Monitor identifies available GPUs, launches new worker processes, and then sets the scaling flag to signal the existing workers to expand the communicator. The scaleCommunicator function manages this process, where workers coordinate to establish the new communicator size and share the required ncclUniqueId.
Conversely, when traffic subsides, the Application Monitor signals a scale-down, identifying which ranks should be removed. For this specific case, the shrinkCommunicator function provides an optimized path using ncclCommShrink, a simplified interface that does not require generating and distributing a new ncclUniqueId. Once ranks exit, their underlying GPU resources can be released back to the cluster’s allocation system or cloud provider.
Finally, both scaleCommunicator and shrinkCommunicator are equipped to handle fault recovery. Once the Application Monitor identifies a faulted component, it can direct the healthy workers to remove it by invoking the Abort path of either function. These paths take extra steps—calling ncclCommAbort or setting the NCCL_SHRINK_ABORT flag—to ensure that the active communicator does not hang while waiting for a peer that has failed.
Get started with scalable and fault-tolerant NCCL applications
NCCL support for dynamic communicators provides a powerful tool for building modern, resilient AI infrastructure. By moving beyond a static, launch-time configuration, you can create applications that adapt to changing workloads and can be optimized for efficiency and cost.
In addition, with the ability to call ncclCommAbort or ncclCommShrink, handling unexpected hardware or software faults is possible without a full abort and restart. Build your next multi-GPU application with these dynamic capabilities to create a scalable and fault-tolerant system. Download the latest NCCL release or use a pre-built container, such as the PyTorch NGC Container.
.png)
3 weeks ago
English (United States) ·
French (France) ·