Initial CUDA Performance Surprises

I am somehow very late to learning CUDA. I didn’t even know until recently that CUDA is just C++ with a small amount of extra stuff. If I had known that there is so little friction to learning it, I would have checked it out much earlier. But if you come in with C++ habits, you’ll write suboptimal code, so here are some lessons I had to learn to get things to run fast.

Memory Coalescing

If you have multiple threads operating on an array in C++, you probably want to iterate like this:

std::vector<T> vec = ...;
size_t per_thread = vec.size() / num_threads;
T * my_slice = vec.data() + per_thread * my_thread_i;
for (size_t i = 0; i < per_thread; ++i) {
    do_something(my_slice[i]);
}

Meaning each thread iterates over a contiguous chunk of memory. In CUDA this is going to be slow because you want the threads to load memory together. So if thread 0 loads bytes 0 to 15, then you want thread 1 to load bytes 16 to 31 and thread 2 to load bytes 32 to 47 etc. So the loop instead has to look like this:

T * data = ...;
size_t num_elements = ...;
for (int i = my_thread_i; i < num_elements; i += num_threads) {
    do_something(data[i]);
}

This is called “memory coalescing” where adjacent threads use adjacent memory. On a loop with a small body (dot product) this is 3x faster.

Most of the Performance is now in Specialized Hardware

Many years ago Sean Parent presented a graph that breaks down where the performance is in a modern PC. I’m reproducing it with current numbers here:

What we see here is the breakdown of theoretical performance in a PC with a Ryzen 9950X and a RTX 4090. The overall theoretical performance is ~95 TFLOPS. These are theoretical, so for example the single-threaded CPU performance is just “5.7 Ghz * 4 instructions per cycle = 22.8 GFLOPS”. That’s the blue line that you can’t see because it’s such a tiny fraction. If you use all 32 threads and AVX 512 you can multiply that performance by 32*16 = 512 to fill up the red and yellow parts of the graph. But if you really want performance, you need to use the GPU which gives you the green part of the graph.

But while these are current numbers, it’s missing most of the GPU performance. The GPU now has specialized hardware for machine learning and for raytracing. When you add those to the graph you get current performance.

This is the same graph plus specialized hardware. For the tensor core I chose the TFLOPS when doing BF16 matrix multiplies. Meaning it’s not exactly a fair comparison because it operates on lower precision (the output is in 32 bits though) but everyone uses this for matrix multiplies and thinks it’s fine.

The point is that now most of the performance in your PC is in specialized chips. If you’re just writing straightforward CUDA code, you’re leaving most of the performance on the table. The graph gets even more lopsided when looking at a deep learning GPU like the H100:

Note how the x-axis now goes above 2000 TFLOPS. If you’re not using tensor cores, the GPU is sitting >90% idle. This is changing the algorithms that are used in deep learning. If algorithm A can just do bigger matrix multiplications to get higher quality results, and algorithm B can achieve better quality results by cleverly doing lots of little pieces of work, people will choose algorithm A.

Different Kinds of Memory

Memory is more complicated in CUDA, but with my limited understanding so far I think of CUDA as having three different types of memory:

Normal memory
Shared memory (faster)
Registers (fastest)

Registers are particularly weird. One thread block has 65536 registers, meaning you can store 256k bytes of data in registers. Which is more than you can store in shared memory. I was trying to understand how some cuDNN kernel could possibly be as fast as it was, when I realized that they keep a particular matrix entirely in registers where each thread holds a small part of the matrix.

You get some control over how many registers you have. You can have up to 1024 threads per thread block, meaning you get 64 registers per thread by default. But you could launch fewer threads and get proportionally more registers per thread. If you need, say 150 registers because you want to cache some data, you divide 65536/150 which tells you that you can use 436 threads.

But you’re still just writing in C++ which doesn’t make it easy to say “keep this data in registers.” The best way I found to do this is to keep a fixed-size array on the stack and then use “#pragma unroll” in every single loop that uses that array. The loop needs to be unrolled because every unrolled iteration of the loop needs to refer to different registers.

Shared memory was straightforward in comparison. It allows you to dedicate some cache space for a specific purpose, and the data is shared between threads. So you can use it for two purposes:

To communicate between threads
To load data more quickly: If you want to load 512 floats and you have 512 threads, every thread can load one float into shared memory. So you don’t even have to loop.

Sharing is ~Free Within a Warp

This one was a delight when I saw code doing this for the first time: A warp is 32 threads that share one instruction pointer. They all do the same thing at the same time. So if you e.g. parallelize a dot product, the 32 threads of the warp can sum their results to get the overall result in five steps, using a parallel sum algorithm:

On a CPU this algorithm is impractical because the overhead of keeping the threads in sync is too high. But on a GPU they just are in sync, so sharing is literally five steps:

__device__ float add_warp(float x) {
    static constexpr const unsigned all = 0xffffffff;
    x += __shfl_xor_sync(all, x, 1);
    x += __shfl_xor_sync(all, x, 2);
    x += __shfl_xor_sync(all, x, 4);
    x += __shfl_xor_sync(all, x, 8);
    x += __shfl_xor_sync(all, x, 16);
    return x;
}

I verified that this compiles down to two instructions each. This compiles to 5 SHFL.BFLY instructions plus 5 FADD instructions for the addition. There are no secret locks or barriers here.

This only works within a warp (32 threads). For a thread block, up to 1024 threads, you can use shared memory, which requires using barriers because the threads won’t automatically be in sync. If you need more threads than that and want to share data between them, don’t. (you’ll often want many more threads, you just can’t share data. You need to write out the result to memory and then launch a new thread to work on the new data)

Parallelism First

My intuition for how many threads to use was wrong by a lot. If you’re iterating over some data and have to do several non-trivial things to it, it’s probably best to launch one thread for each of the things you want to do. It’s tempting to say “this thread already loaded all the relevant data, it can just do a bit of extra work” but in CUDA it’s better to launch a separate thread for that extra work, even if they both have to load the same data. It’s much cheaper for them to synchronize and share their data than it would be on a CPU.

When I ran Nsight Compute on the first couple versions of my code, the feedback that came back could always be summarized as “you’re barely using the GPU, make it more parallel.”

This also means that you often want to pull your algorithm apart. If there is one part that can run massively parallel (across tens of thousands of threads) and one part that has limited parallelism (say only a few hundred threads) then it’s probably worth to launch those as separate kernels to benefit from the massive parallelism on part of your problem, even if that part is only a small part.

So whenever you try to solve a problem, the first question should not be “how can I make this fast?” but “how can I run this in parallel?” After you solve that, worry about making the parallel code fast.

Conclusion

Writing CUDA definitely has a different feeling. It feels more puzzly because it’s so easy to accidentally only use 1% of your GPU. It actually reminds me of TIS-100, especially the trick of distributing data in the registers of multiple threads. But instead of managing a small number of chips you have to figure out how to generate work for tens of thousands of threads. My mental model is that you’ve got a bunch of container ships that can travel at 10% of the speed of light. You’re using them to ship goods around the world. They’re very fast so most of the work is in setting up your harbors so that you can load and unload these container-ships in fractions of a second so that it can sail to do the next thing. It’s not easy to feed these beasts, but if you do it right you can do huge chunks of work in almost no time.

Initial CUDA Performance Surprises

Memory Coalescing

Most of the Performance is now in Specialized Hardware

Different Kinds of Memory

Sharing is ~Free Within a Warp

Parallelism First

Conclusion

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112