Exploring CUDA, Threading and Async Python - Part 3
Previously, we discussed the impact of the GIL on CPU utilization, particularly relevant for pre-processing. We also looked at how batch size affects GPU utilization (and consequently FPS) in an ideal scenario. However, that was far from a real-world case. In practice, there’s usually a pre-processing phase handled by the CPU, with some parts potentially offloaded to the GPU (like normalization, if it makes sense). Regardless, the CPU needs to send data to the GPU while also providing it with instructions to execute.
Keep in mind, what I’m explaining here is purely for illustration purposes. The approach is somewhat impractical and definitely not meant for real execution.
Bloqué
By default, CUDA transfers with PyTorch are synchronous. This means the GPU is dependent on the CPU for these data transfers. If we revisit our previous script and add these data transfers, we can expect degraded performance, with the GPU spending most of its time waiting for the data.
To be more representative of a real-world scenario, the data is randomly generated on the CPU (as a batch of images), and then normalized on the GPU. I'll keep using some ResNet models for my experiments, starting with a small ResNet18. Compared to the previous benchmark, we clearly see a performance drop, as expected. The additional data transfer between CPU and GPU adds overhead, causing the GPU to spend more time waiting for data, which significantly impacts overall performance.
Asynchronous Transfers
Fortunately, it's possible to make data transfers asynchronous in both directions (CPU to GPU and vice versa). However, setting this up is a bit more complex, as it requires the use of pinned memory.
I won’t go too deep into how an OS and memory work (mainly because I only have a surface-level understanding), but basically, the OS allocates virtual memory to each process. Each process behaves as if it's the only one running on the machine. The OS manages the mapping between virtual addresses and physical addresses. To handle memory more efficiently, it’s divided into pages (a few KB each). When virtual memory demand exceeds available physical memory, the OS offloads some pages to the disk (swap). As far as I know, it uses an LRU (Least Recently Used) approach to choose which pages to unload, though I imagine there are more sophisticated approaches nowadays. When a program needs a page that’s no longer in memory (a page fault), the OS reloads it, potentially replacing another unused page (well, ideally).
This is precisely where the problem lies for us. First off, disk access is slow as ****, even with a cutting-edge PCIe 5.0 SSD (let alone an old HDD). Plus, it requires OS intervention and therefore the CPU, which probably has better things to do. Pinned memory ensures that the memory cannot be paged out, allowing for faster and more efficient transfers between the RAM and GPU. This technique helps reduce the bottleneck caused by synchronous transfers, allowing the GPU to continue working while the CPU prepares and transfers the next batch of data. Here are some more details.

This allows the GPU to access memory and perform data transfers without interacting with the CPU, thanks to a feature called Direct Memory Access (DMA). With DMA, the GPU can handle data transfers independently, enabling it to start computations as soon as the data is received, without needing to synchronize with the CPU. This drastically reduces idle time and improves overall performance by ensuring the GPU stays busy even during data transfers. I suppose that DMA is supported by the motherboard socket on all recent systems (probably since 2010 onward) to handle such transfers on the PCIe bus without involving the CPU. Haven't checked.
While the asynchronous transfer from the CPU to the GPU is usually pretty straightforward to achieve, the asynchronous transfer from the GPU to the CPU is a bit trickier. It requires to use a buffer on the CPU side. This buffer must be kept in pinned memory. Otherwise, CUDA doesn't know were to send the data and thus send it to regular paged memory, which prevent asynchronous data transfers as we just saw. Here is an example.
We would then expect to see performance similar to the baseline previously established for the ResNet18 model.
Well, not quite ! It’s a bit better, sure, but it’s not exactly a game-changer. I suspect the model might be to blame. It’s really small and executes very quickly. Even with asynchronous data transfer, the CPU struggles to keep up because it also has to pre-process the data (in this case, a simple random generation). One solution could be to use multiple threads (and not GIL of course) for this task, but I’m feeling lazy. A simpler way to confirm this is to run our benchmarks with a slightly larger model (like a ResNet50, for instance). I've also computed a baseline to assess the model performances without data transfer.
And there we go ! We can see that smaller batches have less impact on our baseline because the GPU computation cost is higher (giving the CPU more time to keep up on providing instructions to the GPU). The maximum batch size is also smaller due to the larger model (thus computation uses more vRAM).
It’s worth noting that with asynchronous transfers, performance is slightly below our baseline for smaller batches. This is likely because the random data generation on the CPU incurs a slight overhead (done on demand using a generator). Our script uses PyTorch’s default CUDA stream. While the CPU is freed during copy operations, the GPU still needs to wait for the data to arrive before it can start processing. A stream cannot handle data transfers if it is already performing computations. By alternating between multiple streams, some could handle data transfers while others handle computations, but we’ll explore that later as this is purely hypothetical.
You can read this excellent presentation to get more details on CUDA concurrency.
GPU Usage
One can observe data transfers with NVIDIA Nsight Systems. Here, I’m focusing on batch sizes of 256 to make the transfers clearly visible (more data to transfer) in both blocking and asynchronous cases for the two architectures studied (ResNet18 and ResNet50).
To identify the different stages of the model, I used NVTX to mark specific sections of code : random_generation
, cpu_to_gpu
, normalization
, model
, and gpu_to_cpu
.
Here, the blocking calls related to data transfers are quite apparent. The CPU waits for the GPU's results (which will be available at the end of the computation) before proceeding to the next iteration (and re-generating a batch of random data). When looking at a broader view of the execution, it becomes clear that the execution is "choppy."
We can compare these executions with their counterparts for ResNet50. In the blocking case, we observe the same "choppy" behavior that degrades performance.
In the asynchronous case, model execution and random data generation overlap. The CPU and GPU work in parallel. However, as suspected, the GPU occasionally pauses during data transfers because everything is executed on a single stream. With multiple streams, transfers and computations could occur simultaneously, potentially improving performance (or at least, that's my guess).
Conclusion
We’ve just explored the impact of data transfers between the CPU and GPU, and how they affect performance. We also touched on CUDA streams, which could help fully leverage the hardware (before any potential model optimizations). We’ll dive into that in the next part. Cheers!