As we move towards a more dense computing infrastructure, with more compute, more GPUs, accelerated networking, and so forth—multi-gpu training and analysis grows in popularity. We need tools and also best practices as developers and practitioners move from CPU to GPU clusters. RAPIDS is a suite of open-source GPU-accelerated data science and AI libraries. These libraries can easily scale-out for larger and larger workloads with tools like Spark and Dask. This blog post provides a brief overview of RAPIDS Dask, and highlights three best practices for mutli-GPU data analysis.
In using GPUs for maximum performance, users often face memory pressure and stability issues. While GPUs are computationally more powerful than CPUs, they typically have less memory compared with system memory. For example, GPU workloads are often executed in out-of-core scenarios where GPU memory is smaller than the total amount of memory required to process the workload at once. Additionally, the CUDA ecosystem provides many types of memory serving a variety of different purposes and applications.
What is Dask?
Dask is an open-source highly flexible library for distributed Python. Dask can help scale complex custom Python code, but more importantly it can scale Array and Dataframe workloads with Dask-Array and Dask-Dataframe. RAPIDS uses Dask-DataFrame to help scale GPU ETL and ML workloads like Dask-cuDF and cuML/XGBoost with a Pandas-like interface.
df = dd.read_parquet( "/my/parquet/dataset/" ) agg = df.groupby( 'B' ). sum () agg.compute() |
Dask Best Practices for CPU and GPU
One of the many benefits of Dask is that users can target both CPU and GPU backends. However, developing two codebases: one for CPU, and one for GPU is time-consuming and difficult to maintain. Dask supports easy switching between CPU and GPU backends.
Similar to PyTorch’s mechanism for configuring the device:
device = ‘cuda’ if torch.cuda.is_available() else ‘cpu’ |
Dask can also configure the backend target:
# GPU Backends dask.config. set ({ "array.backend" : "cupy" }) dask.config. set ({ "dataframe.backend" : "cudf" }) # CPU Backends dask.config. set ({ "array.backend" : "numpy" }) dask.config. set ({ "dataframe.backend" : "pandas" }) # Configure with Environment Variables DASK_DATAFRAME__BACKEND = cudf |
Now, we can easily develop code without having to call specific backend I/O directives and write distributed Python analysis code in a hardware-agnostic fashion:
# Dataframe Example with dask.config. set ({“dataframe.backend”: “cudf”}): ddf = dd.read_parquet( 'example.parquet' ) res = ddf.groupby( "col" ).agg({ 'stats' : [ 'sum' , 'min' , 'mean' ]}) res.compute() |
By using this easy CPU/GPU setting, Dask-RAPIDS users can easily develop for both devices without any development overhead. As a more complete example, the data curation framework in NeMo enables CPU/GPU deployments under-the-hood by using this mechanism.
Cluster and memory configuration
Data workflows are both computationally and memory intensive applications. Configuring and managing how distributed applications use memory can mean the difference between a job finishing and failing. By not using the “correct” memory, workflows can easily result in performance loss, or even worse, failures due to Out-of-Memory (OOM) errors. Picking the “correct” memory and configuring distributed systems is challenging, often requiring a high level of expertise, and potentially time-consuming benchmarking. After many experiments exploring performance, and memory allocation/fragmentation, we’ve found the following configuration to be a good starting point for a variety of tabular-based workloads, such as common ETL (filters, joins, aggregations, etc.), and deduplication (NeMo Curator):
dask cuda worker <SCHEDULER:IP> --rmm-async --rmm-pool-size <POOL_SIZE> --enable-cudf-spill |
Using the RMM options: `rmm-async
` and `rmm-pool-size
` can significantly increase performance and stability.
`rmm-async
` uses the underlying cudaMallocAsync memory allocator, which greatly reduces memory fragmentation at a minor to negligible performance cost. Memory fragmentation can easily lead to OOM errors—for a deep dive, read parts 1 and 2 of the CUDA Stream Ordered Memory Allocator blog series.
The `rmm-pool-size
` parameter preallocates a GPU memory pool on the GPU. By preallocating a pool of memory, the performance of making sub-allocations is significantly faster compared with raw CUDA allocations (direct cudaMalloc). This is largely due to GPU memory allocations and destruction (free) being rather expensive to make, and with data applications, there can be many thousands if not millions of alloc/free calls made in a workflow and in aggregate will degrade performance.
Moving data from device to host, aka `spilling` isn’t just a feature implemented once. Spilling can be implemented generally, but often it comes at the expense of performance. Dask-CUDA and cuDF have several spilling mechanisms: `device-memory-limit
`, `memory-limit
`, `jit-unspill
`, `enable-cudf–spill
`. `enable-cudf-spill
` enables cuDF’s internal spilling to host memory capability. cuDF’s internal spilling mechanism will move data (cuDF buffers) from device to host if objects or intermediates require more memory than is available on the GPU. We’ve found that for tabular-based workloads, using `enable-cudf-spill
` is often faster and more stable compared with the other Dask-CUDA options, including `--device-memory-limit
`.
Accelerated networking
As mentioned above, dense multi-gpu systems are architected with accelerated networking by using NVLink. With Grace Blackwell and the latest NVLink, we can gain access to 900 GB/s of bidirectional bandwidth. Common ETL routines like joins, shuffles, etc. require moving data across multiple device—spilling between CPU and GPU is also critical for out-of-core algorithms. Using NVLink on these dense system becomes paramount for achieving highest performance. These hardware systems can be easily enabled in Dask with UCX enabled:
# Single Node Cluster cluster = LocalCUDACluster(protocol="ucx") # CLI dask-cuda-worker <SCHEDULER_IP> --protocol ucx |
Conclusion
By configuring Dask CUDA workers with optimal memory settings and enabling accelerated networking with UCX, users achieve both stable out-of-core GPU computing and maximum performance. Additionally, targeting multi-hardware backends like CPU and GPU can be easily achieved by using Dask’s array and dataframe backend options.
For documentation and additional details on how best to use Dask RAPIDS, we recommend reading the Dask-cuDF and Dask-CUDA Best Practices documentation.
Best Practices for Multi-GPU Data Analysis Using RAPIDS with Dask