Dynamic Loading in the CUDA Runtime

SOURCE | 2 weeks ago


Enhance your Social Media content with NViNiO•AI™ for FREE


Historically, the GPU device code is compiled alongside the application with offline tools such as nvcc. In this case, the GPU device code is managed internally to the CUDA runtime. You can then launch kernels using <<<>>> and the CUDA runtime ensures that the invoked kernel is launched. 

However, in some cases, GPU device code needs to be dynamically compiled and loaded. This post shows a way to achieve this using the CUDA runtime and also demonstrates a way to achieve interoperability between the CUDA driver and CUDA runtime kernel handles.

In CUDA 12.0, NVIDIA introduced cuLibraryLoad APIs available through the CUDA driver. These APIs enable you to dynamically select and load the GPU device code in a context-independent way. For more information, see CUDA Context-Independent Module Loading

We are now extending this functionality to include dynamic GPU device code loading through the CUDA runtime with a new set of library management APIs that extend the CUDA driver APIs, similar to other CUDA runtime APIs.

Benefits of dynamic GPU device code loading

There are benefits derived from enabling dynamic GPU device code loading:

Explicit control over the GPU device code being loaded, in case it is modified separately from the loading compilation unit. Controlling when the GPU device code is loaded and options for how to load through loading API options.  On-the-fly compilation using other CUDA Toolkit components such as NVRTC to generate the GPU device code modules. On-the-fly selective GPU device code linking using other CUDA Toolkit components such as nvJitLink for link time optimization. A header-only library compiled with nvcc that must do dynamic GPU device code loading can link against CUDA runtime with these changes.

We discuss more details on each of these benefits throughout this post.

Static loading in the CUDA runtime

The CUDA runtime maintains state about what GPU device code is loaded during initialization. The GPU device code modules are determined by what is compiled and linked with compilation tools such as nvcc. During initialization, the CUDA runtime loads these GPU device code modules and you interacts with them implicitly, as in the following example:

main.cu: #include <stdio.h> __global__ void helloWorld() { printf(“Hello from the GPU!\n”); } int main(int argc, char *argv[]) { cudaSetDevice(0); helloWorld<<<1,1,1>>>(); return cudaDeviceSynchronize(); }

This simplified example, when compiled with nvcc, creates an executable with the appropriate GPU device code module that enables the CUDA runtime to run and execute the helloWorld kernel on the GPU.

Dynamic loading in the CUDA driver

The CUDA driver requires you to dynamically load the GPU device code to execute as well as manage more state such as CUDA contexts that the CUDA runtime automatically manages for you. A similar example would be broken into two files, with separate compilation trajectories. For more information about the various compilation trajectories, see NVIDIA CUDA Compiler Driver NVCC.

The code for the GPU would be compiled using nvcc into a standalone GPU device code module such as a .fatbin, .cubin, or standalone PTX file (that is, device.fatbin for this example).

Then, you would have a main source file that uses and manages this .fatbin file containing the compiled and linked GPU device code module. A portion of the main source file is shown here without error checking for ease of reading:

main.c: #include <cuda.h> int main(int argc, char *argv[]) { … cuDeviceGet(&dev, 0); cuDevicePrimaryCtxRetain(&ctx, dev); cuCtxPushCurrent(ctx); cuLibraryLoadFromFile(&library, “device.fatbin”, NULL, NULL, 0, NULL, NULL, 0); cuLibraryGetKernel(&kernel, library, “helloWorld”); cuLaunchKernel((CUfunction)kernel, 1, 1, 1, 1, 1, 1, 0, NULL, NULL, NULL); cuCtxSynchronize(); cuLibraryUnload(library); cuDevicePrimaryCtxRelease(dev); return 0; }

The benefits of dynamic loading listed earlier are extended to the CUDA runtime and described further in the Use case enablement section.

Dynamic loading in the CUDA runtime

With the changes in CUDA to support dynamic loading in the CUDA runtime, we made the flexibility of dynamically loading GPU device code available to the CUDA runtime. This means that the earlier example can be condensed to the following code. This removes the explicit CUDA context management overhead needed for the driver example. A portion of an updated main source file is shown here without error checking for ease of reading:

main.cu: #include <cuda_runtime_api.h> int main(int argc, char *argv[]) { … cudaLibraryLoadFromFile(&library, “device.fatbin”, NULL, NULL, 0, NULL, NULL, 0); cudaLibraryGetKernel(&kernel, library, “helloWorld”); cudaLaunchKernel((const void*)kernel, 1, 1, NULL, 0, NULL); cudaDeviceSynchronize(); cudaLibraryUnload(library); return 0; }

Use case enablement

What kinds of use cases does this enable? Here are some examples that weren’t possible before:

Pure CUDA runtime API usage Interchangeability of types between the CUDA driver and CUDA runtime Handle sharing between CUDA runtime instances

Pure CUDA runtime API usage

Up to now, all dynamic GPU device code module loading required driver APIs. Other libraries or applications that could compile using NVRTC or link GPU device code dynamically using nvJitLink required the driver for loading their generated output. 

With the new CUDA runtime dynamic loading APIs, the loading, management, and usage of these dynamic outputs can be done purely through the CUDA runtime.

Here’s an example modified from the NVRTC documentation mentioned earlier: an NVRTC SAXPY example updated to use the new CUDA runtime APIs.

Current NVRTC SAXPY Example Snippet // Load the generated PTX and get a handle to the SAXPY kernel. CUdevice cuDevice; CUcontext context; CUmodule module; CUfunction kernel; CUDA_SAFE_CALL(cuInit(0)); CUDA_SAFE_CALL(cuDeviceGet(&cuDevice, 0)); CUDA_SAFE_CALL(cuCtxCreate(&context, 0, cuDevice)); CUDA_SAFE_CALL(cuModuleLoadDataEx(&module, ptx, 0, 0, 0)); CUDA_SAFE_CALL(cuModuleGetFunction(&kernel, module, “saxpy”)); … //Execute SAXPY void *args[] = {&a, &dX, &dY, &dOut, &n}; CUDA_SAFE_CALL( cuLaunchKernel(kernel, NUM_BLOCKS, 1, 1, // grid dim NUM_THREADS, 1, 1, // block dim 0, // shmem NULL, // stream args, 0)); // arguments
Updated NVRTC SAXPY Example Snippet // Load the generated PTX and get a handle to the SAXPY kernel. cudaLibrary_t library; cudaKernel_t kernel; CUDART_SAFE_CALL(cudaLibraryLoadData(&library, ptx, 0,0,0,0,0,0)); CUDART_SAFE_CALL(cudaLibraryGetKernel(&kernel, library, “saxpy”)); … //Execute SAXPY void *args[] = {&a, &dX, &dY, &dOut, &n}; CUDART_SAFE_CALL( cudaLaunchKernel((void*)kernel, NUM_BLOCKS, // grid dim NUM_THREADS, // block dim args, // arguments 0, // shmem NULL)); // stream

One other benefit, is that before now, a header-only library compiled with nvcc that had to do dynamic GPU device code loading would levy the user requirement of linking to the CUDA driver at compile time. Now it’s possible for these header-only libraries to not require linking to the CUDA driver explicitly by using the CUDA runtime linked with nvcc.

This also means that two distinct sets of code, one for the historical static method of CUDA runtime GPU device code loading and one for the dynamic GPU device code loading of the CUDA driver can be converged into one set of code using the CUDA runtime.

Interchangeability of types between the CUDA driver and CUDA runtime

Previously, kernel handles were not interchangeable between the CUDA runtime and CUDA driver as many other handles are, such as CUDA streams and CUDA events.

Where before you could not exchange a kernel handle between CUDA runtime and CUDA driver, now cudaKernel_t and CUkernel (as well as cudaLibrary_t and CUlibrary) are interchangeable. To load using the CUDA runtime APIs but launch or set kernel attributes using the CUDA driver API, you can cast between the types. 

Now, to do dynamic GPU device code loading, you don’t have to use only the CUDA driver APIs. You can use either set of APIs and only cast between the CUDA driver and CUDA runtime types.

Handle sharing between CUDA runtime instances

Consider two theoretical libraries, library A and library B, each linked to their own static CUDA runtime.

The implicit nature of the historical CUDA runtime loading did not enable CUDA kernel handles to be shared between multiple CUDA runtime instances. In this case, there would be no way to share kernel handles from each of these libraries.

Now, with the CUDA runtime API cudaGetKernel, you can get a handle to any of its kernels and pass it to another CUDA runtime instance. If there is a need to share a CUDA kernel between the two libraries, then library A can call cudaGetKernel and pass the handle to library B. This has the potential benefit of increasing the amount of code sharing between libraries and reducing the need for each library to include their own kernel implementations.

In the following code example, libmatrix_mul.cu uses the new dynamic loading in the CUDA runtime APIs, and libvector_add.cu uses the traditional implicit loading in the CUDA runtime but takes advantage of the new cudaGetKernel API to get a handle to a shareable CUDA kernel. 

In both cases, you can pass the handle to the cudaKernel_t to a third independent library, libcommon, to launch and use cudaKernel_t. This is possible even if they are linked to their own static CUDA runtime instances.

// matrix_mul.cu - using dynamic shared handles void matrix_mul() { cudaLibrary_t lib; cudaKernel_t kern; cudaLibraryLoadData(&lib, ptx, …); // ptx from nvrtc cudaLibraryGetKernel(&kern, lib, “matrixMul”); libcommon.foo(kern); } // vector_add.cu - using implicit shared handles __global__ void vectorAdd() { … } void vector_add() { cudaGetKernel(&kern, vectorAdd); libcommon.foo(kern); } // libcommon.cu - takes a shareable kernel handle void foo(cudaKernel_t kern) { cudaLaunchKernel(kern, ...); }

This example is trivial but it shows the potential host and GPU memory space savings by libraries deduplicating the kernels needed between each other.

Get started with CUDA runtime dynamic loading

In this post, we introduced new CUDA runtime APIs that provide the ability to load GPU device code. This is a simpler way to load and execute device code on the GPU when only the CUDA runtime API is required. 

To start using these APIs, download the CUDA Toolkit version 12.8 or higher from CUDA Toolkit. For more information about the cudaLibrary* and cudaKernel* APIs, see the CUDA runtime API documentation.


Enhance your brand's digital communication with NViNiO•Link™ : Get started for FREE here


Read Entire Article

© 2025 | Actualités africaines & Moteur de recherche. NViNiO GROUP

_