This blog post is part of a series designed to help developers learn NVIDIA CUDA Tile programming for building high-performance GPU kernels, using matrix multiplication as a core example.
In this post, you’ll learn:
How to implement high-performance matrix multiplication using NVIDIA cuTile: Understand the flow of Tile loading, computation, and storage. About the block-level parallel programming mindset: Shift from thread-level thinking to block-level thinking. Best practices for Tile programming: Learn performance optimization strategies from the code.Before you begin, be sure your environment meets the following requirements (see the quickstart for more information):
Environment requirements:
CUDA 13.1 or higher GPU architecture NVIDIA Blackwell (e.g., NVIDIA RTX 50 series) Python: 3.10 or higherInstall cuTile Python:
Note: cuTile is the next-generation GPU programming framework for NVIDIA. While it only supports optimization for the Blackwell (compute capabilities 10.x and 12.x) architecture, support for more architectures will be provided in upcoming releases of the CUDA Toolkit.
What is matrix multiplication?
Matrix multiplication is a fundamental operation in modern technical computing. It’s the operation that is the basis for solving systems of equations. It underpins graphics, simulations, optimization, and most of machine learning, and it maps well to high-performance hardware like GPUs.
Given input matrices A (MxK) and B (KxN), the formula for calculating an element in the result matrix C (MxN) is as follows.
From the formula, you can see that an element of C is computed by taking the dot product of a row of A and a column of B.
Tile programming can simplify the implementation while achieving excellent performance by dividing the output matrix into multiple tiles. Each Block is responsible for the calculation of one output tile, and cuTile automatically handles memory access and thread synchronization. Specifically:
Each Block processes a (tm × tn) tile of the output matrix C. Loop over the K dimension, loading corresponding tiles of A and B one by one. Use ct.mma() to perform matrix multiply-accumulate (automatically invoking Tensor Cores). Finally, store the accumulated results back in global memory.Figure 1 shows the calculation process, which is like an element-by-element algorithm, but in this case, the tiles take the place of individual elements
Figure 1. Illustration of matrix multiply (A * B = C) broken into tilesGPU kernel implementation
Having described the core idea, let’s look at the complete implementation code. The code is divided into two parts: the kernel running on the GPU and the launch code on the CPU, as shown in the code that follows.
Now, let’s break down each key part step by step.
1. Define the GPU kernel
In cuTile, the @ct.kernel decorator is used to mark a standard Python function as a GPU kernel:
This decorator indicates that:
This function will execute on the GPU. Each block will run an independent instance of this function. It can’t be called directly and must be launched using ct.launch().2. Compile-time optimization: Constant type annotation
Notice that the parameters tm, tn, and tk use a special type annotation ct.Constant[int]:
This indicates they are compile-time constants. cuTile will generate specialized machine code for different tile size values, allowing the compiler to:
Perform loop unrolling. Optimize memory access patterns. Generate optimal Tensor Core instructions.3. Determining work scope: Block ID mapping
Each block computes a specific tile of the output matrix. Through the swizzle_2d() function, we obtain the index of the block currently being processed:
The function of this code is to determine which tile of the output matrix the current block should process. To understand this process, let’s start with the grid division on the host side.
Step 1: Host-side grid division
When launching the kernel on the host side (explained in Section 3), calculate how many blocks are needed:
Logically, launch grid_x * grid_y blocks and flatten them into a 1D grid: grid = (grid_size, 1, 1).
Step 2: Getting block ID in kernel
Inside the kernel, each block gets its unique identifier via ct.bid(0):
Step 3: Mapping 1D block ID to 2D tile coordinates
The problem now is that the block ID (bid) is 1D, but the output matrix is 2D. We need to know which row and column tile this Block should process. The swizzle_2d_from_bid() function determines which row and column tile the block is responsible for processing.
Output result:
bidx: The row index (M dimension) of the output tile the current block is responsible for. Range: [0, grid_x-1]. bidy: The column index (N dimension) of the output tile the current block is responsible for. Range: [0, grid_y-1].The specific mapping logic involves swizzling (used to improve memory access efficiency), which we will explain in detail in Section 4. For now, just understand that it converts a 1D Block ID into 2D tile coordinates.
5. Preparing the accumulator: Initializing output tile
Before looping through the K dimension, you need to create an accumulator to store intermediate results:
6. Core computation loop: Traversing the K dimension
This is the core of matrix multiplication. Now loop through every tile in the K dimension and accumulate the results:
Loading data:
ct.load(A, index=(bidx, k), shape=(tm, tk)): Loads a tile from matrix A. index=(bidx, k): Specifies the coordinates of the tile to load in tile space. shape=(tm, tk): The size of the tile. padding_mode=zero_pad: Fills with zeros if the load data is out of bounds.Matrix multiply-accumulate:
ct.mma(a, b, accumulator): Multiplies a * b, adds to accumulator, and stores the result in accumulator (mma stands for matrix multiply-accumulate) When the shapes of a and b meet Tensor Core requirements, cuTile automatically invokes the GPU’s Tensor Cores to accelerate this operation.After the loop ends, the accumulator stores the complete result for the output tile.
Writing back results: Storing to global memoryFinally, write the calculated result back to global memory:
Launching the kernel: Host-side code
Now launch the kernel from the host. First, look at the complete code.
Launching the kernel on the host side requires three key steps:
Step 1: Calculate grid size
Based on the input matrix dimensions and tile size, calculate how many blocks are needed:
Step 2: Set tile size (compile-time constants)
Select appropriate tile dimensions based on data type:
These parameters are passed to the kernel as compile-time constants:
tm: Output tile rows (M dimension). tn: Output tile columns (N dimension). tk: Size of tile loaded each time in K dimension.Note: The tile size configuration here is an example. In practice, different GPU architectures require different parameter configurations to achieve optimal performance. Best configurations depend on M/N/K sizes, GPU architecture, shared memory size, register count, SM count, etc. In development, it is recommended to use performance analysis tools (like NVIDIA Nsight Compute) to find optimal parameters. TileGym provides an autotuner to automatically obtain optimal parameters.
Step 3: Call ct.launch() to start kernel
Argument tuple: All parameters passed to the kernel; tm, tn, and tk will be recognized by the compiler as constants.
Performance optimization: Swizzle
Earlier swizzling was introduced to improve performance. The code for swizzle_2d_from_bid is shown.
How does swizzle improve performance?
It re-maps block IDs to tile index through grouping and interleaving to use cache more efficiently.
Using four elements (shaded areas) of the output matrix as an example, the figure compares linear versus swizzled memory access
Figure 2. Visual comparison of linear row access versus tiled block access Method 1: Linear row access
Calculates one row of data in the result matrix (e.g., four elements). Needs to read four blocks from the left matrix + all 16 blocks from the right matrix. Total memory access: 20 data blocks. Right matrix data is loaded frequently and replaced quickly, resulting in a low cache hit rate.Method 2: Swizzling / tiled block access
Reorganizes computation into 2×2 local blocks. Only needs to read eight relevant blocks from the left matrix + eight relevant blocks from the right matrix. Total memory access: 16 data blocks (20% reduction). Better data locality results in a higher cache hit rate.Performance benchmarks
To verify the performance of the implemented matrix multiplication kernel, it was tested on an NVIDIA GeForce RTX 5080 (compute capability 12.0). You can find the complete benchmark code in the TileGym repository. Make sure to follow the installation instructions, and then you can run this and the other tests following the Quick Start instructions.
Test configuration:
Data Type: float16 Matrix shape: Standard square matrix (N×N) Test sizes: N = 1024, 2048, 4096, 8192, 16384 (i.e., 2^10 to 2^14)The following figure shows the performance under different matrix sizes.
Figure 3. cuTile and PyTorch TFLOP/s performance data compared to matrix size on NVIDIA GeForce RTX 5080The results show that:
At large matrix scales, the cuTile implementation can fully utilize the GPU’s computing power. Through appropriate tile size configuration and swizzle optimization, the cuTile implementation achieves over 90% of the performance compared to SOTA implementations (PyTorch calling cuBLAS).Summary
This classic matrix multiplication example shows the complete process of implementing a GPU kernel using cuTile. Although matrix multiplication is simple, it contains the core ideas of Tile programming. Mastering these concepts will enable you to implement various high-performance GPU kernels using cuTile. Check out the full matrix multiply example and more in the TileGym repo, and start writing high-performance tile code today.
.png)
3 months ago
English (United States) ·
French (France) ·