CUDA

Exploring CUDA, Threading and Async Python - Part 3

Previously, we discussed the impact of the GIL on CPU utilization, particularly relevant for pre-processing. We also looked at how batch size affects GPU utilization (and consequently FPS) in an ideal scenario. However, that was far from a real-world case. In practice, there’s usually a pre-processing phase handled by

Efficient Connected Components Labeling for PyTorch

Some AI tasks sometimes require a post-processing step known as Connected Components Labeling (CCL). This can, for instance, be the case in certain Computer Vision tasks that involve segmentation (such as text or object detection). To the best of my knowledge, there is no library that allows this to be

Exploring CUDA, Threading and Async Python - Part 2

In my previous blog post, we discussed pre-processing, particularly through multithreading. Now, let’s try to understand what it means to put the GPU under pressure. First, we'll focus on "native" PyTorch. We might dive into model-level optimizations later (and what that means for execution). CUDA