Follow this series to learn about CUDA programming from scratch with Python
GPUs (graphics processing units), as the name implies, were originally developed for computer graphics. Since then, they have become ubiquitous in almost every area that requires high computational throughput. This progress has been enabled by the development of GPGPU (general purpose GPU) interfaces, which allow us to program GPUs for general-purpose computing. The most common of these interfaces is CUDA, followed by OpenCL and most recently, HIP.
CUDA was originally designed to be compatible with C. Later versions extended it to C++ and Fortran. In the Python ecosystem, one of the ways of using CUDA is through Numba, a Just-In-Time (JIT) compiler for Python that can target GPUs (it also targets CPUs, but that’s outside of our scope). With Numba, one can write kernels directly with (a subset of) Python, and Numba will compile the code on-the-fly and run it. While it does not implement the complete CUDA API, its supported features are often enough to obtain impressive speedups compared to CPUs (for all missing features, see the Numba documentation).
Numba is not the only option, however. CuPy offers both high level functions which rely on CUDA under the hood, low-level CUDA support for integrating kernels written in C, and JIT-able Python functions (similar to Numba). PyCUDA provides even more fine-grained control of the CUDA API. More recently, Nvidia released the official CUDA Python, which will surely enrich the ecosystem. All of these projects can pass device arrays to each other, you are not locked into using only one.
The goal of this series is to provide a learning platform for common CUDA patterns through examples written in Numba CUDA. What this series is not, is a comprehensive guide to either CUDA or Numba. The reader may refer to their respective documentations for that. The structure of this tutorial is inspired by the book CUDA by Example: An Introduction to General-Purpose GPU Programming by Jason Sanders and Edward Kandrot. If you eventually grow out of Python and want to code in C, it is an excellent resource.
We will learn how to run our first Numba CUDA kernel. We will also learn how to use CUDA efficiently for embarrassingly parallel tasks, that is, tasks which are completely independent from each other. Finally, we will learn how to time our kernel runtimes from the CPU.
The biggest advantage GPUs have over CPUs is their ability to execute the same instructions in parallel. A single CPU core will run instructions serially, one after the other. Parallelizing over a CPU requires the use of its multiple cores (physical or virtual) at the same time. A standard modern computer has 4–8 cores. On the other hand, modern GPUs have hundreds if not thousands of compute cores. See Figure 1 for a comparison between these two. GPU cores are generally slower and can only execute simple instructions, but their sheer number usually makes up for those shortcomings manyfold. The caveat is that in order for GPUs to have an edge of CPUs, the algorithms they run must be parallelizable.
I believe there are four main aspects to grokking GPU programming. The first I already mentioned: understanding how to think and design algorithms that are parallel by nature. This can be hard both because some algorithms are designed serially, but also because there can be many ways of parallelizing the same algorithm.
The second aspect is learning how to map structures which sit on the host such as vectors and images, onto GPU constructs such as threads and blocks. Recurring patterns and helper functions can aid us in this, but at the end of the day, experimentation will be important to get the most out of your GPU.
The third is comprehending the asynchronous execution model that drives GPU programming. Not only GPUs and CPUs execute instructions independently from each other, GPUs have streams which allow multiple processing streams to run in the same GPU. This asynchronicity is important when designing optimal processing flows.
The fourth and final aspect is the relation between abstract concepts and concrete code: this is achieved by learning the API and its nuances.
As you read this first chapter, try to identify these concepts in the following examples!
We will start by setting up our environment: a Numba version higher than 0.55 and a supported GPU.
The main workhorse of Numba CUDA is the
cuda.jit decorator. It is used to define functions which will run in the GPU.
We’ll start by defining a simple function, which takes two numbers and stores them on the first element of the third argument. Our first lesson is that kernels (GPU functions that launch threads) cannot return values. We get around that by passing inputs and outputs. This is a common pattern in C, but not very common in Python.
As you may have noticed, before we call the kernel, we need to allocate an array on the device. In addition, if we want to display the returned value, we need to copy it back to the CPU. You might be asking yourself why we chose to allocate a
float32 (single-precision float). This is because, while supported in most modern GPUs, double precision arithmetic can take 4x or longer than single precision arithmetic. So it’s better to get used to using
np.complex64 instead of
Whereas the kernel definition looks similar to a CPU function, the kernel call is a little different. In particular, it has square brackets before the arguments:
add_scalars[1, 1](2.0, 7.0, dev_c)
These square brackets refer to the number of blocks in a grid, and the number of threads in a block, respectively. Let’s talk a little bit more about what these mean as we learn to parallelize with CUDA.
The anatomy of a CUDA grid
When a kernel is launched it has a grid associated with it. A grid is composed of blocks; a block is composed of threads. Figure 2 shows a one dimensional CUDA grid. The grid in the figure has 4 blocks. The number of blocks in a grid is held in a special variable which can be accessed inside the kernel called
.x is refers to the first dimensional of the grid (the only one in this case). Two dimensional grids also have
.y and three dimensional grids,
.z variables. As of 2022, there are no 4-dimensional grids or higher. Also inside the kernel, you can find out which block is being executed through the use of
blockIdx.x, which in this case will run from 0 to 3.
Each block has a certain number of threads, held in the variable
blockDim.x. Thread indices are held in the variable
threadIdx.x, which in this example will run from 0 to 7.
Importantly, threads in different blocks are scheduled to run differently, have access to different memory regions, and differ in some other ways (see CUDA Refresher: The CUDA Programming Model for a brief discussion). For now, we will skip these details.
When we launched the kernel in our first example with parameters
[1, 1], we told CUDA to run one block with one thread. Passing several blocks with several threads, will run the kernel many times. Manipulating
blockIdx.x will allow us to uniquely identify each thread.
Instead of summing two numbers, let’s try to sum two arrays. Suppose the arrays each have 20 elements. Like in the figure above, we can launch a kernel with 8 threads per block. If we want each thread to handle only one array element, we will then need at least 4 blocks. Launching 4 blocks, 8 threads each, our grid will then launch 32 threads.
Now we need to figure out how to map the thread indices to the array indices.
threadIdx.x runs from 0 to 7, so on their own they are not able to index our array. In addition, different blocks have the same
threadIdx.x. On the other hand, they have different
blockIdx.x. To obtain a unique index for each thread, we can combine these variables:
i = threadIdx.x + blockDim.x * blockIdx.x
For the first block,
blockIdx.x = 0 and
i will run from 0 to 7. For the second block,
blockIdx.x = 1. Since
blockDim.x = 8,
i will run from 8 to 15. Similarly, for
blockIdx.x = 2,
i will run from 16 to 23. In the fourth and final block,
i will run from 24 to 31. See Table 1 below.
We solved one problem: how to map each thread to each element in the array… but now we have an issue where some threads would overflow the array, since the array has 20 elements and
i goes up to 32-1. The solution is simple: for those threads, don’t do anything!
Let’s see the code.
In newer versions of Numba, we get a warning noting that we called the kernel with host arrays. Ideally, we want to avoid moving data around from host to device, as this is very slow. We should be calling the kernel with device arrays in all arguments. We can do that by moving the array from host to device beforehand:
dev_a = cuda.to_device(a)dev_b = cuda.to_device(b)
Moreover, the calculation of unique indices per thread can get old quickly. Thankfully Numba provides the very simple wrapper
cuda.grid which is called with the grid dimension as the only argument. The new kernel will look like this:
What happens when we change the size of the array? One easy way out is to simply change the grid parameters (number of blocks and threads per block) in order to launch at least as many threads as there are elements in the array.
There is some science and some art to setting these parameters. For the “science”, we will say that (a) they should be a multiple of two, typically between 32 and 1024, and (b) they should be chosen so as to maximize occupancy (how many threads are active at the same time). Nvidia provides a spreadsheet that can help calculating these. For the “art”, nothing can predict the behavior of your kernels, so if you truly want to optimize these parameters, you need to profile your code with typical inputs. In practice, a “reasonable” number of threads for modern GPUs is 256.
Before we move on from summing vectors, we need to talk about hardware limits. GPUs cannot run an arbitrary number of threads and blocks. Typically each block cannot have more than 1024 threads, and a grid cannot have more than 2¹⁶ − 1 = 65535 blocks. This is not to say that you can launch 1024 × 65535 threads… there are limits to the number of threads that can be launched based on how much memory their registers occupy, among other considerations. Moreover, one must be wary of trying to process large arrays which do not fit in the GPU RAM all at once. In these cases, one may benefit from processing the arrays piecewise, either using a single GPU or multiple GPUs.
INFO: In Python, hardware limits can be obtained through Nvidia’s
cuda-python library through the function
cuDeviceGetAttribute in their documentation. See the Appendix at the end of this section for an example.
In cases where the number of blocks per grid exceeds the hardware limit but the array fits in memory, instead of using one thread per array element, we can use one thread to process several elements. We will do so by using a technique called grid-stride loops. Besides overcoming hardware limitations, grid-stride loop kernels benefit from reusing threads, by minimizing thread creation/destruction overhead. Mark Harris’ blog post CUDA Pro Tip: Write Flexible Kernels with Grid-Stride Loops goes into detail about some of the benefits of grid-strided loops.
The idea behind this technique is to add a loop inside of the CUDA kernel to process multiple input elements. The stride of this loop, as the name implies, is equal to the number of threads in a grid. This way, if the total number of threads in the grid (
threads_per_grid = blockDim.x * gridDim.x) is smaller than the number of elements of the array, as soon as the kernel is done processing the index
cuda.grid(1) it will process the index
cuda.grid(1) + threads_per_grid and so on until all array elements have been processed. Without further ado, let’s look at the code.
This code is very similar to the above, with the different that we are starting at
cuda.grid(1), but executing more samples, one every
threads_per_grid until we hit the end of the array.
Now, which one of these kernels is faster?
GPU programming is all about speed. Therefore it is important to measure code execution accurately.
CUDA kernels are device functions that are launched by the host (CPU), but of course they are executed on the GPU. The GPU and the CPU don’t communicate unless we tell them to. So when the GPU kernel is launched, the CPU will simply continue running instructions, be they launching more kernels or executing other CPU functions. If we place a
time.time() call before and after kernel launch, we will be timing only how long it takes for the kernel to launch, not to run.
One function we can use to ensure that the GPU has “caught up” is the
cuda.synchronize(). Calling this function will stop the host from executing any other code until the GPU finishes execution of every kernel that has been launched in it.
To time a kernel execution, we can then simply time how long it takes for the kernel to run and then synchronize. There are two caveats to this. First, we need to use
time.time() does not count the time that the host is sleeping waiting for the GPU to finish execution. The second caveat is that timing code from the host is not ideal as there are overheads related to this. Later, we will explain how one can use CUDA events to time kernels from the device. Mark Harris has another excellent blog post about this topic entitled How to Implement Performance Metrics in CUDA C/C++.
When using Numba, we have one detail we must pay attention to. Numba is a Just-In-Time compiler, meaning that the functions are only compiled when they are called. Therefore timing the first call of the function will also time the compilation step which is in general much slower. We must remember to always compile the code first by launching the kernel and then synchronizing it to ensure that nothing is left to run in the GPU. This ensures that the next kernel runs immediately without compilation. Also note that the
dtype of the array should be the same, as Numba compiles a unique function for each combination of argument
For simple kernels, we can also measure the throughout of the algorithm, which equals the number of floating point operations per second. It is usually measured in GFLOP/s (giga-FLOP per second). Our adding operation contains only one FLOP: addition. As such, the throughput is given by:
To end this tutorial, let’s craft a 2D kernel to apply logarithmic correction to an image.
Given an image I(x, y) with values between 0 and 1, the log-corrected image is given by
Iᵪ(x, y) = γ log₂ (1 + I(x, y))
First let’s grab some data!
As you can see, the data is really saturated at the lower end. There are almost no values above 0.6.
Let’s write the kernel.
Let us make note of the two
for loops. Notice that that the first
for loop starts at
iy and the second, innermost loop starts at
ix. We could have easily chosen
i0 to start at
i1 to start at
iy instead, which would even feel more natural. So why did we choose this order? It turns out that the memory access pattern for the first choice is more efficient. Since the first grid index is the fastest one, we want to make it match our fastest dimension: the last one.
If you don’t want to take my word for it (and you shouldn’t!) you have now learned how to time kernel executions, and you can try out the two versions. For small arrays such as the one using here, the difference is negligible, but for larger arrays (say 10,000 by 10,000), I have measured a speedup of about 10%. Not super impressive, but if I could give you a 10% improvement with a single swapping of variables, who wouldn’t take it?
And that’s it! We can now see more details in the corrected image.
As an exercise, try timing different launches with different grids to find the optimal grid size for your machine.
In this tutorial you learned the basics of Numba CUDA. You learned how to create simple CUDA kernels, and move memory to GPU to use them. You also learned how to iterate over 1D and 2D arrays using a technique called grid-stride loops.
For fine-grained control over the exact attributes of your GPU, you can rely on the lower-level, official CUDA Python package provided by Nvidia.