Accelerating Vision AI Pipelines with Batch Mode VC-6 and NVIDIA Nsight

SOURCE | 2 months ago

✨ Enhance your Social Media content with NViNiO•AI™ for FREE

In vision AI systems, model throughput continues to improve. The surrounding pipeline stages must keep pace, including decode, preprocessing, and GPU scheduling. In the previous post, Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6, this was described as the data-to-tensor gap—a performance mismatch between AI pipeline stages.

The SMPTE VC-6 (ST 2117-1) codec addresses this gap through a hierarchical, tile-based architecture. Images are encoded as progressively refinable Levels of Quality (LoQs), each adding incremental detail. This enables selective retrieval and decoding of only the required resolution, region of interest, or color plane, with random access to independently decodable frames. Pipelines can retrieve and decode only what the model needs.

However, efficient single-image execution does not automatically translate to efficient scaling. As batch sizes grow, the bottleneck shifts from single-image kernel efficiency to workload orchestration, launch cadence, and GPU occupancy.

This post focuses on the architectural changes required to scale VC-6 decoding for batched inference and training workloads. As NVIDIA Nsight Systems and NVIDIA Nsight Compute allow developers to identify system- and kernel-level constraints, they were leveraged to redesign the VC-6 CUDA implementation for batch throughput. The result is up to ~85% lower per-image decode time compared to the previous implementation, with submillisecond decode for LoQ-0 (~4K) in batch and ~0.2 ms for lower LoQs, with identical output quality. This significantly improves pipeline efficiency for production vision AI workloads.

Introducing the VC-6 batch mode implementation

The new implementation is built around several architectural changes, including batch mode and kernel-level optimizations.

Batch mode: From N to a single decoder

Execution model redesign Algorithmic changes to decode multiple images simultaneously with a single decoder Improved parallelization Leveraging the new work dimension (images) next to existing parallelization dimensions (tiles, planes) to shift initial VC-6 tile hierarchy work to GPU. Minibatch pipelining

Kernel-level optimizations

Nsight Compute driven range decoder kernel optimization The optimizations led to a ~20% kernel speedup

The following sections detail these changes to the VC-6 decoder in depth. As for any CUDA optimization, the plan was to start with a system-level profiler like Nsight Systems to identify and fix initial performance bottlenecks, and then use Nsight Compute to refine individual kernels.

Moving from N to a single decoder

The top part of Figure 1 shows the starting point, as detailed in Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6.

A comparison between two Nsight Systems screenshots. Above, in “individual async decoders”, each decoder issues its own kernel, resulting in a quite messy overview with many decoders calling many small kernels, resulting in less than 100% load. Below, with the new version, just two CUDA API rows and large launched kernels are shown. 100% load is indicated for most of the time.

Figure 1. Batching several image decodes into a single VC-6 decoder, shown in Nsight Systems

The middle rows show heavy CUDA API usage, each corresponding to a separate decoder instance decoding a single image each. In All Streams, many small, concurrently running kernels on the GPU are shown in blue. The top row shows device utilization. Light orange is less than full utilization, dark orange indicates full load. In this example, even with enough dispatched work, full utilization is rarely indicated. The profiled algorithm is consequently not optimal.

This inefficiency is explained by the execution of numerous small kernels. Each kernel launch has several associated overheads, like scheduling and kernel resource management. In this setting, constant per-kernel overhead and little work per kernel lead to an unfavorable ratio between overhead and actual work.

Changing this requires altering the paradigm from many small kernels to a few larger kernels.

In this case, NVIDIA Nsight motivated an execution model redesign from N decoders for N images to a single decoder that decodes batches of N images at once. This new execution model redistributes the fixed amount of work into fewer kernels, each with more work. The bottom part of Figure 1 shows the effect of this reimplementation. It shows only two CUDA API timelines, only a handful of large kernels, and full GPU utilization, indicated by the dark orange GPU utilization.

Shifting more work to GPU

In the initial implementation, decoding the root and narrow levels of the VC-6 tile hierarchies were performed on the CPU. For single-image decoding, the amount of work in these narrower stages was too small to justify GPU execution. In the batched design, although the work per image remains small, the aggregation of multiple images provides sufficient parallelism to efficiently utilize the GPU.

Additionally, the algorithm was modified to eliminate host-side logic for handling variable image dimensions. With that embedded in GPU kernels, NVIDIA Nsight showed that this reduced both synchronization points and submission latency, while increasing pipeline fluidity.

Figures 2 and 3 show the utilization and CPU overhead overview of decoding images at LoQ-0 and LoQ-2, indicating more severe inefficiencies for LoQ-2.

Similarly to the comparison in Figure 1, Figure 2 shows two Nsight Systems screenshots, this time showing NVTX ranges in relation to GPU execution statistics like clocks, compute in flight, SMs active, warp occupancy and more.

Figure 2. Comparing GPU utilization: multiple decoders versus batch mode (full resolution, LoQ-0) with CPU overhead shown in bottom row

The same comparison as in the previous figures for a lower quality, LoQ-2. It largely shows the same effect, that even for this amount of work, the GPU and the warps are fully utilized in the new setting, as compared to the previous software state.

Figure 3. Comparing GPU utilization: multiple decoders versus batch mode (quarter resolution, LoQ-2) with CPU overhead shown in bottom row

However, with batch mode VC-6 (bottom of Figures 2 and 3), GPU execution of even the smallest LoQs is feasible because the aggregated workload of several images can be efficiently computed on GPU.

Minibatch pipelining

The new decoder design splits each batch into minibatches. These go through a pipeline consisting of CPU processing, PCIe transfer, and GPU decoding stages. Images of a minibatch reside in a pipeline stage simultaneously, while stages operate concurrently and hide each other’s costs.

An Nsight Systems screenshot showing GPU utilization in relation to two instances of the CUDA API and NVTX. The GPU is fully utilized at all times, whilst the CUDA APIs show NVTX ranges titled “UPLOAD”, “GPU” and “CPU”. This indicates that during processing of one batch, the previous is downloaded to the host, and the next is uploaded to the GPU. The full utilization of the GPU indicates that this works well.

Figure 4. Minibatch pipelining in Nsight Systems showing GPU utilization in relation to instances of CUDA API and NVTX

Figure 4 illustrates this minibatch pipelining. Similar to Figure 1, the CUDA API calls are dispatched from two threads, UPLOAD and GPU, with minimal host-side resource usage. Work aggregation has clearly reduced CUDA API calls, memory operations, and synchronizations, while amortising kernel launch overhead across the batch.

Kernel-level optimizations

Nsight Systems revealed that the initial optimizations alleviated CPU overhead, and further improvements require kernel optimization. The terminal_decode kernel implementing a range decoder is noteworthy. Nsight Compute highlighted previously noncritical microarchitectural constraints. The following algorithmic issues were highlighted: typical low-level inefficiencies such as low streaming SM occupancy, warp divergence, noncoalesced memory accesses, and register pressure. These insights are essential for developers to then eliminate or minimize these algorithmic issues where possible.

The Nsight Compute source heatmap and Warp Stall Sampling (All Samples) highlight measured time spent per individual source line. They show that significant time is spent on integer divisions in the range update logic (Figure 5). Since GPUs are not optimized for integer division and accuracy is non-negotiable, these operations cannot be optimized.

A screenshot of Nsight Compute showing three lines of source code on the left, each associated with warp stall sampling (all samples), and (not-issued samples.) The source code lines containing integer divisions have a large green and purple bar associated next to it, also a percentage of 11% in both cases. This indicates that a lot of the algorithm time is spent there.

Figure 5. Range decoder updates show that significant time is spent in integer divisions

For the decoder table lookup, implemented as binary search on shared memory, Nsight Compute also revealed significant short scoreboard stalls (Figure 6).

An Nsight Compute screenshot, this time source code lines and stall reasons in relation to SASS-instructions, the latter including a scoreboard dependency graphic. The scoreboard dependencies show that some SASS machine instruction lines are dependent on previous operations. In this case, two lines need to wait for an LDS instruction to finish. This results in an execution stall, as the load needs to be finished before the execution can continue.

Figure 6. Short scoreboard stalls (pink) on the left, scoreboard dependencies for LDS (load shared memory) operations (right)

These stalls point to shared memory loads (LDS in Figure 6), as dynamic indexing into a local array would otherwise result in slow local memory access. Because the lookup tables are constant in size, it is possible to replace this approach with a local variable and an unrolled loop. Compared to the binary search, this exhaustive search enables constant indexing into a fixed-size array that can reside in registers. The combination of these two changes applied to both range decoders produced a ~20% speedup of this kernel.

Two side-by-side images showing a subset of the memory hierarchy diagram of Nsight Compute. This shows data the kernel consumes, for example directly from global, local and shared memory, and then also from L1/Tex Cache. Two versions are shown, before and after. In the before, the kernel reads from shared, global and local memory, in the after, it avoids reading from shared and local memory, only reading from global memory. For the L1 cache, the hitrate also goes up (9.4% vs 71.77%), indicating that this path is leveraged more.

Figure 7. Kernel memory usage before modifications (left) and after modifications (right), indicating no use of shared memory (bottom) and local

Figure 7 shows the memory charts of Nsight Compute. Visually, it confirms that neither shared (last row) nor local memory (row 2) is used after the modification.

The trade-off is increased register usage, from 48 to 92 registers per thread. Here, it is acceptable given the per-thread limit of 255 registers and the relatively small grid dimensions of this kernel. Since targeting high block residency per SM is not a priority at this stage, the additional register pressure does not limit overall throughput.

Another optimization was to replace a custom selection routine with a cub::DeviceSelect function call. This simplifies the code, and off-loads the maintenance and optimization aspects for current and upcoming hardware to CUB.

Performance scaling and updated results

Figure 8 compares per-image decode time across batch sizes between the previous and improved implementation, evaluated at four LoQs (LoQ-0 ~4K, LoQ-1 ~2K, LoQ-2 ~1K, LoQ-3 ~0.5K) using the UHD-IQA dataset (available through V-Nova on Hugging Face).

A bar chart, comparing VC-6 before and after optimization in per-image decode time in relation to batch size from 1-256. Four LoQs are shown, corresponding to even steps between 4K down to ~0.5K. Overall, the optimized version is much faster. More details are in the text.

Figure 8. VC-6 per-image decode time comparison on NVIDIA L40s (g6e.8xlarge), UHD-IQA dataset

Two distinct scaling behaviors emerge:

The previous implementation plateaus beyond small batch sizes (approximately 1–16). Additional images do not translate into further per-image gains. In contrast, the optimized CUDA implementation continues to improve as batch size increases. For example, LoQ-0 (~4K) decode time drops below 1 ms per image at large batch sizes. The relative improvement grows at lower LoQs. Smaller per-image workloads expose more independent work that can be aggregated, resulting in better GPU utilization. At higher batch sizes, LoQ-2 decoding reaches ~0.2 ms per image and LoQ-3 ~0.14 ms.

Measured improvements include:

~36% lower per-image decode time at batch size 1 (LoQ-0) ~70–80% lower per-image decode time at batch sizes 16–32 for LoQ-2 and LoQ-3 Up to ~85% lower per-image decode time at batch size 256

Figure 9 shows the performance of the redesigned implementation across batch sizes on NVIDIA H100 (Hopper), and NVIDIA B200 (Blackwell) GPUs. The results indicate that the performance gains are not silicon-specific but stem from the improved batch mode. This effectively exposes sufficient parallel work to saturate modern GPU architectures.

A bar chart, comparing 3 LoQ decode times (per-image decode time in ms) in relation to batch sizes from 1-256. Left shows the result of an H100, and right of a B200 GPU. The scaling is similar, batch-size 1 taking the longest, and the rest being successively faster. It shows that on both GPUs, the algorithm works as desired.

Figure 9. VC-6 batch mode, per-image decode time (NVIDIA H100, NVIDIA B200, UHD-IQA dataset)

VC-6 for vision AI pipelines

Intelligent and tailored-to-fit decoding leveraging VC-6 random-access intra-only, LoQ decoding, and selective region-of-interest or color channel access can benefit training, inference, and video summarization workflows. This is an avenue for future work.

Get started with VC-6 decoding

Scaling VC-6 decoding requires more than kernel tuning. Nsight profiling reveals structural limits in launch cadence, occupancy, thread divergence, and memory behavior. By redesigning the CUDA execution model to expose more independent work and amortize overhead across batches, the new implementation achieves up to ~85% lower per-image decode time, reaching submillisecond decode for LoQ-0 (~4K) in batch and ~0.2 ms for lower LoQs, with identical output quality.

As vision AI workloads continue to scale, overall pipeline efficiency is determined at every step, including both the decode and preprocessing stages.

To get started, check out these resources:

VC-6 samples Examples for VC-6 encoding and selective decoding Benchmark suite to reproduce our results with Hugging Face datasets VC-6 AI Blueprint Demo showcasing VC-6 selective decoding in vision AI pipelines Reference integration patterns for multiple use cases

✨ Enhance your brand's digital communication with NViNiO•Link™ : Get started for FREE here