By enabling CUDA kernels to be written in Python similar to how they can be implemented within C++, Numba bridges the gap between the Python ecosystem and the performance of CUDA.
However, CUDA C++ developers have access to many libraries that presently have no exposure in Python. These include the CUDA Core Compute Libraries (CCCL), cuRAND, and header-based implementations of numeric types like bfloat16, to name a few.
While each CUDA C++ library can introduce itself to Python in its own way, manually making bindings for each library is laborious, repetitive, and prone to inconsistency. For example, the float16 and bfloat16 data types define over 60 similar standalone functions and similar bindings would be needed multiple times for both types.
In addition, manually created bindings often fall out of sync when the underlying CUDA C++ library introduces new features.
Solution: Numbast
Numbast establishes an automated pipeline that converts CUDA C/C++ APIs into Numba bindings.
At a high level, top-level declarations are read from CUDA C++ header files, serialized, and passed to Python APIs. Numba binding generators then iterate through these declarations and generate Numba extensions for each of the APIs.
Demo: C++ declaration of a simple struct
To show Numbast in action, the following example shows how Numba bindings are created for a demo myfloat16 type. These C++ declarations are inspired by those in the CUDA float16 header, giving a simplified version to demonstrate the binding generation in practice.
C++ declaration
This demo shows the following elements in C++ syntax:
For more information about the language features supported in Numbast, see Supported CUDA C++ declarations.
Setup script with Numbast
Numbast usage normally involves two steps:
Parsing header files with AST_Canopy. Generating Numba bindings from parsed headers.The following code example sets up the Numba bindings by implementing these two steps:
Data models are the different ways that Numba represents the underlying data. The PrimitiveModel model that is used for myfloat16 is well-suited for scalars. A StructModel model (not used here) is useful for classes and structs. Other data models are less commonly used.
Usage in the most natural way
In CUDA C++, you can construct a myfloat16 object and use it as follows:
In Numba kernels, you use them as-is:
Thanks to type inference in Numba, the code is even cleaner than the original C++.
First supported bindings: bfloat16 data type
The first Numba binding supported through Numbast is a new bfloat16 data type. The bfloat16 data type can interoperate with PyTorch’s torch.bfloat16 data type, so that you can efficiently develop custom compute kernels with this new data type.
The following code example shows how to use the new bfloat16 data type to develop a Numba kernel that performs computation on torch tensors. It passes a PyTorch array of torch.bfloat16 type into a Numba compute kernel and performs math operations using CUDA intrinsics bound through Numbast.
You can download Numbast and the bfloat16 Numba bindings from conda-forge:
Architecture
Numbast consists of two components:
AST_Canopy: An underlying layer that parses and serializes C++ headers Numbast: A user-facing layer that consumes the parsed results and builds Numba bindings dynamically.AST_Canopy: declaration parser
In forest ecology, a canopy refers to the upper layer of a forest habitat zone. AST_Canopy is the package that inspects the top-level declarations in the forest of abstract syntax trees, extracting information from them and passing them to the Python layer. Here, top level refers to the public, user-facing APIs that a CUDA C++ library exposes to users.
Figure 1. Layered architecture of AST_CanopyFigure 1 shows the architecture of AST_Canopy:
clangTooling: a Clang library used to support writing standalone tools, such as Numbast. libastcanopy: Implements the declaration-parsing logic using clangTooling. pylibastcanopy: A binding that directly exposes libastcanopy APIs in Python. AST_Canopy: A layer on top of pylibastcanopy that provides a pleasantly Pythonic user experience.In addition to header parsing and serialization, AST_Canopy also provides the following features:
Environment detection at runtime: Auto-detects the libstdcxx and CUDA headers installed through Conda packages, and sets up the clang compilers accordingly. Flexibility in compute capability parsing: Enables configuring AST parsing based on different compute capabilities. Some headers conditionally expose code based on compute capability, and this feature supports cases where the header serialization and runtime environments are different.Numbast: binding generator
Numbast is downstream to AST_Canopy, which consumes the declaration information and generates Numba bindings automatically. At a high level, Numbast exists to provide a translation layer between C++ and Python syntax. As the demo suggests, most simple C++ syntax finds its natural counterpart in Python (Table 1).
| Operation | CUDA C++ | Numba |
| Object construction | auto hpi = myfloat16(3.14) | hpi = myfloat16(3.14) |
| Attribute access | auto data = hpi.data | data = hpi.data |
| Function invocation | auto r = hsqrt(hpi) | r = hsqrt(hpi) |
| Type casts | auto fpi = float(hpi); | fpi = types.float32(hpi) |
| Arithmetic operations | auto pi2 = hpi + hpi | pi2 = hpi + hpi |
Numba’s type system has much in common with C and C++-like languages. Of course, there are also features not present in Python, such as pointer semantics and template-based meta-programming.
Numbast is the middle layer that encapsulates the similarities and differences between C++ and Python.
Lowering: The bigger picture
Bindings generated using Numbast are lowered with a feature in Numba called foreign function invocation (FFI). Numba ABI-compatible shim functions wrapping over the native CUDA function call are generated and then compiled with NVRTC. Expect the same optimized performance that a CUDA C++ developer enjoys, minus the performance of FFI.
Future versions of Numba-cuda will introduce link time optimization (LTO) support, further eliminating the performance gap between accelerated Numba kernel and native CUDA C++.
Caveats
Both AST_Canopy and Numbast have caveats worth noting. AST_Canopy depends on clangTooling. Both new CUDA language features that are not yet supported by clangTooling and libraries that depend on the new language feature may not be correctly parsed. However, the majority of libraries and headers make use of features supported by clangTooling.
Conclusion
In this post, we introduced a new Numba binding generation tool, Numbast. We showed that, by using Numbast, you can quickly benefit from the ever-growing feature set of CUDA C++.
Numbast v0.1.0 provides the new data type bfloat16 to Numba. You can expect to see more bindings generated by Numbast, including new data types, NVSHMEM bindings, and CCCL bindings.
.png)
1 year ago
English (United States) ·
French (France) ·