Matrix multiplication; the holy grail of deep neural networks and modern language understanding behemoths. As MLEs or data scientists, our fingers are too quick to type `tf.matmul`

or `torch.matmul`

and we never look back. But don’t tell me you’ve never had the millisecond infatuation to know what might be happening to that matrix when it enters the GPU! If you did, you’re in the right place. Join me in a journey through the fascinating intricacies within a GPU.

I’ll explain to you how these compute powerhouses crunch up the numbers. You’ll learn three little-known impressive things GPUs do, when they come face-to-face with matrices. By the end of this blog post, you’ll have a good understanding of how matrix multiplication works inside GPUs.

GEMM or generalized matrix multiplication is the kernel that’s executed when GPUs perform matrix multiplication.

`C = a (A.B) + b C`

Here, `a`

and `b`

are scalars, `A`

is an `MxK`

matrix, `B`

is an `KxN`

matrix, and thus `C`

is an `MxN`

matrix. It’s easy as that! You might wonder why that trailing addition exists. Turns out this is a pretty common pattern for neural networks (e.g. adding bias, applying ReLU, adding residual connections).

If you’re asked to write a matrix multiplication algorithm from first principles, here’s what you’ll do (unless you’re gifted with a GPU in lieu of a brain — wouldn’t that save money for an MLE!).

`for (int i = 0; i < M; ++i)`

for (int j = 0; j < N; ++j)

for (int k = 0; k < K; ++k)

C[i][j] += A[i][k] * B[k][j];

Here’s an animated visual that shows you what this does.

But did you know GPUs despise this implementation 🤔? To understand why that’s the case, you need to understand the GPU memory architecture,

For all comparisons and specifications, I’ll be using the Nvidia A100 GPU specifications.

A GPU has three main memory levels,

- Global memory or HBM (what you typically refer to as GPU memory and what you see when you run
`nvidia-smi`

) - Shared memory (a local memory that is dedicated to a single streaming multiprocessor [or SM] and shared between threads running in that SM)
- Registers (individually allocated to threads to carry out their workload)

This is what it looks like,

The first thing to note is that shared memory (referred to as SRAM from now on) is way smaller than the HBM, let alone registers. So your matrix is not going to fit in there (in most occasions). If we go back to our animation, for a single row of `A`

all columns of `B`

needs to be retrieved, and repeat the process for all rows in `A`

. This means, the GPU needs to do many-many reads to compute the output. The HBM (~1.5TB/s) is several magnitudes slower than SRAM (~19TB/s).

To put that in numbers, say you want to multiply a `10x20`

and `20x30`

matrix, you need to read columns of `B`

`10x30=300`

times. Is there a better way we can do this?

Turns out a simple trick can go a long way here! Simply flip the order of the loops, so that `k`

becomes the outer most loop. And you’re done! 😮

`for (int k = 0; k < K; ++k) `

for (int i = 0; i < M; ++i)

for (int j = 0; j < N; ++j)

C[i][j] += A[i][k] * B[k][j];

We did not touch the actual computation, just the order of the loops, so we should get the same result as before. Here’s what the matrix multiplication looks like now!

You see, we only bring one *column* of `A`

and one *row* of `B`

at a time and never look back. This requires far less reads than the original implementation. The only difference is we were computing the *inner product* between two vectors before, now we’re computing the *outer product*.

But still, we need entire `C`

in SRAM, which might be too big to fit in SRAM. What does CUDA do then? That brings us to the second trick.

Not to worry! I’m not going to blast you with any complex mathematics or Leetcode algorithms. The main thing to keep in mind is, a matrix is a 2D layout of individual tiles. The following animation does justice to what I’m trying to explain.

The result of the green block 💚 is the light blue strip of A 💙 and the light yellow strip of B 💛. Taking this a step further, to compute the output, you can bring one block of that strip of A and one block from B’s strip at a time, compute the output and accumulate the result in the green box.

This gives us a flexible framework where we can load an arbitrary size block (or tile) of A and B and still compute the final answer. We don’t have to stop there, we can keep recursively dividing the problem to even smaller problems. i.e. the matrix is broken into tiles, tiles are broken into fragments, and fragments to individual values.

And this lends itself nicely to the process execution architecture in a GPU. There are three layers to a kernel execution in a GPU. For simplicity, we’ll say a SM runs a single thread block (although in practice they execute them concurrently, to reduce something known as the tail effect).

- Threads
- Warps (a collection of 32 threads)
- Thread blocks (a collection of several warps)

The exact number of threads in a thread block depends on a specific architecture. For example, an A100 has the following specifications.

- Maximum of 2048 threads per SM
- Maximum of 1024 threads per block
- Maximum of 32 thread blocks per SM

## Sidebar #2: Magic of the power of 2

Going back to the tiling, It has been found that (heuristically) a matrix tile of size `256x128`

per thread block gives reasonable efficiency for most problems. Therefore it’s a common tile size used by CUDA.

You might have heard about a best practice of keeping batch size, hidden dimension size as powers of two. This is where this comes from! When your matrix dimensions are of powers of two, it will be fully divisible to a set of tiles with no remainder. If not, it makes your code less efficient.

GPU computations are more efficient when your matrix dimensions are in the power of 2

What happens when it’s not a power of 2?

## Sidebar #2: Tile quantization

What happens is an effect known as *tile quantization*. In other words, if you have a tile row dimension of 128 but your matrix has 257 elements in a row, you’ll need not two, but three tiles in a row (i.e. 256+1). This is illustrated below.

Problem with this is that, the thread block does the same amount of computation regardless of the useful data residing in it. So, you’re taking the opportunity to do useful computation away from your GPU, leading to inefficiencies.

A similar effect is known as wave quantization, where the matrix is over-sized and the SMs collectively cannot fit it at once. Then the GPU needs to do the computation in 2 “waves”. However this is less of a concern for modern GPUs as they leverage concurrency to reduce wave quantization.

Tile quantization happens when a thread block has to spill data partially, wave quantization happens when SMs have to spill data.

The final trick is kernel fusion. More often than not, it is faster to do all the computations in one kernel than having two kernels called one after the other. Why? Because one kernel needs to write the data to HBM and other needs to read that back. We already talked about how slow this is. A better approach is just combine the two operations into one. Some examples are,

So as it is seen here (I’m sure Pytorch has a similar glossary), there are many fused kernels offered through TensorFlow that combines commodious operations in to a single kernel. In code, it means something like this,

`for (int m = 0; m < M; m += Mtile) `

for (int n = 0; n < N; n += Ntile)

for (int k = 0; k < K; ++k)

float tmp = 0

for (int i = 0; i < Mtile; ++i)

for (int j = 0; j < Ntile; ++j)

int row = m + i;

int col = n + j;

tmp += A[row][k] * B[k][col];

// Do other things

C[row][col] = tmp

In other words, we hold dearly to our `tmp`

variable until after we’ve finished all our computations. Then only we’ll write the result back to `C`

.

That’s it folks. I hope this was an enjoyable excursion through the weeds of a GPU. If you’re interested in the audio-visual version here’s the link to my YouTube video.