Quantcast
Viewing latest article 9
Browse Latest Browse All 12

Initial CUDA Performance Surprises

I am somehow very late to learning CUDA. I didn’t even know until recently that CUDA is just C++ with a small amount of extra stuff. If I had known that there is so little friction to learning it, I would have checked it out much earlier. But if you come in with C++ habits, you’ll write suboptimal code, so here are some lessons I had to learn to get things to run fast.

Memory Coalescing

If you have multiple threads operating on an array in C++, you probably want to iterate like this:

std::vector<T> vec = ...;
size_t per_thread = vec.size() / num_threads;
T * my_slice = vec.data() + per_thread * my_thread_i;
for (size_t i = 0; i < per_thread; ++i) {
    do_something(my_slice[i]);
}

Meaning each thread iterates over a contiguous chunk of memory. In CUDA this is going to be slow because you want the threads to load memory together. So if thread 0 loads bytes 0 to 15, then you want thread 1 to load bytes 16 to 31 and thread 2 to load bytes 32 to 47 etc. So the loop instead has to look like this:

T * data = ...;
size_t num_elements = ...;
for (int i = my_thread_i; i < num_elements; i += num_threads) {
    do_something(data[i]);
}

This is called “memory coalescing” where adjacent threads use adjacent memory. On a loop with a small body (dot product) this is 3x faster.

Most of the Performance is now in Specialized Hardware

Many years ago Sean Parent presented a graph that breaks down where the performance is in a modern PC. I’m reproducing it with current numbers here:

Image may be NSFW.
Clik here to view.

What we see here is the breakdown of theoretical performance in a PC with a Ryzen 9950X and a RTX 4090. The overall theoretical performance is ~95 TFLOPS. These are theoretical, so for example the single-threaded CPU performance is just “5.7 Ghz * 4 instructions per cycle = 22.8 GFLOPS”. That’s the blue line that you can’t see because it’s such a tiny fraction. If you use all 32 threads and AVX 512 you can multiply that performance by 32*16 = 512 to fill up the red and yellow parts of the graph. But if you really want performance, you need to use the GPU which gives you the green part of the graph.

But while these are current numbers, it’s missing most of the GPU performance. The GPU now has specialized hardware for machine learning and for raytracing. When you add those to the graph you get current performance.

Image may be NSFW.
Clik here to view.

This is the same graph plus specialized hardware. For the tensor core I chose the TFLOPS when doing BF16 matrix multiplies. Meaning it’s not exactly a fair comparison because it operates on lower precision (the output is in 32 bits though) but everyone uses this for matrix multiplies and thinks it’s fine.

The point is that now most of the performance in your PC is in specialized chips. If you’re just writing straightforward CUDA code, you’re leaving most of the performance on the table. The graph gets even more lopsided when looking at a deep learning GPU like the H100:

Image may be NSFW.
Clik here to view.

Note how the x-axis now goes above 2000 TFLOPS. If you’re not using tensor cores, the GPU is sitting >90% idle. This is changing the algorithms that are used in deep learning. If algorithm A can just do bigger matrix multiplications to get higher quality results, and algorithm B can achieve better quality results by cleverly doing lots of little pieces of work, people will choose algorithm A.

Different Kinds of Memory

Memory is more complicated in CUDA, but with my limited understanding so far I think of CUDA as having three different types of memory:

  1. Normal memory
  2. Shared memory (faster)
  3. Registers (fastest)

Registers are particularly weird. One thread block has 65536 registers, meaning you can store 256k bytes of data in registers. Which is more than you can store in shared memory. I was trying to understand how some cuDNN kernel could possibly be as fast as it was, when I realized that they keep a particular matrix entirely in registers where each thread holds a small part of the matrix.

You get some control over how many registers you have. You can have up to 1024 threads per thread block, meaning you get 64 registers per thread by default. But you could launch fewer threads and get proportionally more registers per thread. If you need, say 150 registers because you want to cache some data, you divide 65536/150 which tells you that you can use 436 threads.

But you’re still just writing in C++ which doesn’t make it easy to say “keep this data in registers.” The best way I found to do this is to keep a fixed-size array on the stack and then use “#pragma unroll” in every single loop that uses that array. The loop needs to be unrolled because every unrolled iteration of the loop needs to refer to different registers.

Shared memory was straightforward in comparison. It allows you to dedicate some cache space for a specific purpose, and the data is shared between threads. So you can use it for two purposes:

  1. To communicate between threads
  2. To load data more quickly: If you want to load 512 floats and you have 512 threads, every thread can load one float into shared memory. So you don’t even have to loop.

Sharing is ~Free Within a Warp

This one was a delight when I saw code doing this for the first time: A warp is 32 threads that share one instruction pointer. They all do the same thing at the same time. So if you e.g. parallelize a dot product, the 32 threads of the warp can sum their results to get the overall result in five steps, using a parallel sum algorithm:

Image may be NSFW.
Clik here to view.

On a CPU this algorithm is impractical because the overhead of keeping the threads in sync is too high. But on a GPU they just are in sync, so sharing is literally five steps:

__device__ float add_warp(float x) {
    static constexpr const unsigned all = 0xffffffff;
    x += __shfl_xor_sync(all, x, 1);
    x += __shfl_xor_sync(all, x, 2);
    x += __shfl_xor_sync(all, x, 4);
    x += __shfl_xor_sync(all, x, 8);
    x += __shfl_xor_sync(all, x, 16);
    return x;
}

I verified that this compiles down to two instructions each. This compiles to 5 SHFL.BFLY instructions plus 5 FADD instructions for the addition. There are no secret locks or barriers here.

This only works within a warp (32 threads). For a thread block, up to 1024 threads, you can use shared memory, which requires using barriers because the threads won’t automatically be in sync. If you need more threads than that and want to share data between them, don’t. (you’ll often want many more threads, you just can’t share data. You need to write out the result to memory and then launch a new thread to work on the new data)

Parallelism First

My intuition for how many threads to use was wrong by a lot. If you’re iterating over some data and have to do several non-trivial things to it, it’s probably best to launch one thread for each of the things you want to do. It’s tempting to say “this thread already loaded all the relevant data, it can just do a bit of extra work” but in CUDA it’s better to launch a separate thread for that extra work, even if they both have to load the same data. It’s much cheaper for them to synchronize and share their data than it would be on a CPU.

When I ran Nsight Compute on the first couple versions of my code, the feedback that came back could always be summarized as “you’re barely using the GPU, make it more parallel.”

This also means that you often want to pull your algorithm apart. If there is one part that can run massively parallel (across tens of thousands of threads) and one part that has limited parallelism (say only a few hundred threads) then it’s probably worth to launch those as separate kernels to benefit from the massive parallelism on part of your problem, even if that part is only a small part.

So whenever you try to solve a problem, the first question should not be “how can I make this fast?” but “how can I run this in parallel?” After you solve that, worry about making the parallel code fast.

Conclusion

Writing CUDA definitely has a different feeling. It feels more puzzly because it’s so easy to accidentally only use 1% of your GPU. It actually reminds me of TIS-100, especially the trick of distributing data in the registers of multiple threads. But instead of managing a small number of chips you have to figure out how to generate work for tens of thousands of threads. My mental model is that you’ve got a bunch of container ships that can travel at 10% of the speed of light. You’re using them to ship goods around the world. They’re very fast so most of the work is in setting up your harbors so that you can load and unload these container-ships in fractions of a second so that it can sail to do the next thing. It’s not easy to feed these beasts, but if you do it right you can do huge chunks of work in almost no time.

Image may be NSFW.
Clik here to view.

Viewing latest article 9
Browse Latest Browse All 12

Trending Articles