NVIDIA CUDA 13 3  Image © NvidiaNVIDIA CUDA 13 3 (Image © Nvidia)

Extended programming options for C++ and Python

An important new feature in this release is CUDA tile programming for C++. This model automates memory movement and parallelism, allowing developers to write high-level tile kernels that remain portable across different GPU architectures, including the Hopper series.

At the same time, CUDA Python has reached version 1.0. This milestone introduces a strict semantic versioning policy to ensure API stability. The cuda.core library is now stable and offers several advanced features:

  • Green Contexts: These allow partitioning of streaming multiprocessors (SMs) to protect latency-sensitive kernels from long-running tasks.
  • Process Checkpointing: A Linux-exclusive feature that allows snapshots to be taken and the full CUDA state of a process to be restored.
  • Inter-Process Sharing: Facilitates GPU memory sharing across multiple Python processes without the need for host-side copying.

In addition, the new Numba CUDA MLIR backend replaces the standard numba.cuda import, resulting in significantly lower compilation latency and reduced dispatch overhead on the host side.

Performance optimization via CompileIQ

To maximize kernel efficiency, NVIDIA has introduced CompileIQ. Unlike standard compilers that are based on general heuristics, CompileIQ uses genetic and evolutionary algorithms to generate specialized configurations for individual kernels. This framework provides a performance increase of up to 15% for critical operations such as GEMM and Attention kernels, which are central to Large Language Model (LLM) inference.

Library updates and Tensor interoperability

The CUDA Core Compute Libraries (CCCL 3.3) now offer improved Tensor interoperability. By using DLPack and mdspan, developers can transfer tensors between Python frameworks and C++ kernels without losing structural information. The library also offers a comprehensive suite of 17 random distributions and a new search algorithm, cub::DeviceFind::FindIf, which shows a performance increase of up to 7x.

The central math libraries have also been specifically optimized:

  • cuBLAS and cuSPARSE: Both libraries have been updated to support the Blackwell architecture and new matrix formats, with certain APIs in cuSPARSE showing a 2.5x jump in performance.
  • cuSOLVER: The introduction of 64-bit interfaces and the low-precision precondition has reduced the solution time for large matrices on B200 GPUs by about 20%.

Compiler and system level optimizations

The NVCC and NVRTC compilers now fully support the C++23 standard. To simplify the development workflow, NVRTC now bundles standard headers, eliminating the need to manually manage include paths.

At the system level, CUDA 13.3 improves multi-tenant stability through MPS sub-error isolation, which ensures that an error in a client partition does not terminate unaffected processes. In addition, support for mmap() provides a low-latency alternative for mapping discrete GPU memory to the CPU in environments where certain kernel drivers cannot be installed.