At NVIDIA, we take pride in tackling complex infrastructure challenges with precision and innovation. When Volcano faced GPU underutilization in their NVIDIA DGX Cloud-provisioned Kubernetes cluster, we stepped in to deliver a solution that not only met but exceeded expectations.
By combining advanced scheduling techniques with a deep understanding of distributed workloads, we achieved around 90% GPU occupancy, well above the contractual target of 80%. Here’s a detailed look at the problem, our approach, and the results.
Problem: GPU fragmentation and scheduling Inefficiencies
The DGX Cloud Kubernetes cluster consisted of thousands of GPUs, each equipped with multiple NVIDIA L40S GPUs. The cluster supported diverse workloads:
- Multi-node, multi-GPU distributed training jobs
- Batch inferencing for high-throughput AI models
- GPU-backed data-processing pipelines
- Interactive notebooks for development and analytics
Despite the availability of robust hardware, the cluster suffered from GPU fragmentation, where nodes were left partially occupied and unusable for larger jobs. This inefficiency was compounded by the default behavior of the Volcano Scheduler, which used a gang scheduling algorithm.
Without intervention, we risked breaching the contractual agreement to maintain at least 80% GPU occupancy. This would result in reduced cluster capacity as unused resources were reclaimed for other teams.
Key challenges
The implementation of this solution required overcoming two primary obstacles:
- Gang scheduling’s all-or-nothing approach: Distributed jobs requiring multiple GPUs across nodes were queued indefinitely unless all required resources were available simultaneously. This led to bottlenecks and delayed task execution.
- Fragmentation due to random placement: Workloads were assigned to nodes based on simple heuristics or random selection, often leaving GPUs scattered across nodes in a fragmented state (Figure 1).
Solution: Integrating bin-packing with gang scheduling
To address these challenges, we implemented an enhanced scheduling strategy by integrating a bin-packing algorithm into the Volcano Scheduler. This approach focused on consolidating workloads to maximize node utilization while leaving other nodes entirely free for larger jobs.
Technical implementation
Our approach to resolving GPU fragmentation involved three key components:
- Workload prioritization:
- Resources were ranked in descending order of importance: GPUs, CPUs, and memory.
- Nodes suitable for incoming workloads were shortlisted based on resource requirements, for example, selectors and affinity rules.
- Optimized placement through bin-packing:
- Partially occupied nodes were ranked by their current utilization levels, lowest to highest.
- Workloads were placed on nodes with the least free resources first, ensuring that nodes became fully utilized before moving to others.
- Gang scheduling integration:
- The enhanced scheduler maintained gang scheduling’s all-or-nothing principle but added intelligence to prioritize workload placement based on resource consolidation.
These points were enabled using the following Volcano Scheduler configuration, enabling efficient workload placement and prioritizing node utilization.
volcano - scheduler.conf: | actions: "enqueue, allocate, backfill" tiers: - plugins: - name: priority - name: gang enablePreemptable: false - name: conformance - plugins: - name: drf - name: predicates - name: proportion - name: nodeorder - name: binpack arguments: binpack.weight: 10 binpack.cpu: 2 binpack.memory: 2 binpack.resources: "nvidia.com/gpu" binpack.resources.nvidia.com / gpu: 8 binpack.score.gpu.weight: 10 - plugins: - name: overcommit - name: rescheduling |
Example scenario
To illustrate the effectiveness of our solution, let’s consider the following scenario:
Initial state
Two nodes—Node A with two GPUs occupied and Node B, which is free—receive a new workload requiring two GPUs.
Default gang scheduling
The workload is placed randomly, for example, on Node B, leaving both nodes partially occupied.
As a result, the cluster contained more partially occupied nodes than completely free nodes. Figure 3 shows that only 18 nodes with all four GPUs were accessible, however there were around 115 nodes with three free GPUs, which cannot be used for a training job that requires four GPUs per node.
The impact of this fragmentation becomes even more apparent when we examine the distribution of partially occupied nodes.
Solution
The workload is placed on Node A, fully using its four GPUs while leaving Node B completely free for future, large-scale jobs.
Figure 5 shows that Node A and B, with partially occupied GPUs, receives a new workload requiring two GPUs and places them on Node B using the default gang scheduling mechanism, lowering the capacity on both nodes.
As a result, the cluster had more free nodes than occupied nodes. Figure 6 shows that there are 214 nodes with all four GPUs available, which a training job requiring four GPUs can easily use.
Figure 7 shows that there are approximately nine nodes with three free GPUs.
This strategic placement significantly reduced fragmentation and improved resource availability for large training jobs.
Results achieved
The integration of bin-packing into the Volcano Scheduler transformed the GPU cluster’s performance:
- Increased resource availability: The number of fully free nodes (with all four GPUs available) increased enabling seamless scheduling of large-scale training jobs.
- Improved GPU occupancy: Average GPU utilization rose to an industry-leading 90%, far exceeding the 80% contractual requirement.
- Enhanced cost efficiency: By optimizing resource usage, we avoided capacity reductions and maintained full access to their allocated cluster resources without additional overhead costs.
- Scalability across workloads: The solution proved effective not only for distributed training jobs but also for batch inferencing and GPU-backed data processing tasks.
Finally, as a result, the whole cluster occupancy averaged out at roughly 90%, which was much beyond the Resource Governance team’s criterion of 80%.
Broader implications for distributed systems
This post highlighted how thoughtful scheduling strategies can resolve pervasive challenges in multi-node multi-GPU clusters.
- Proactive resource management prevents bottlenecks: By anticipating fragmentation issues and addressing them through optimized placement algorithms, organizations can avoid costly delays and underutilization.
- Cost-aware engineering enhances ROI: Efficiently using existing infrastructure reduces the need for additional hardware investments while maximizing performance.
- Flexible scheduling accommodates diverse workloads: The integration of bin-packing algorithms demonstrates how schedulers such as Volcano can adapt to meet specific workload requirements without overhauling existing systems.
Related resources
- TechBlog: Fine-Tune and Align LLMs Easily with NVIDIA NeMo Customizer
- TechBlog: Accelerate Apache Spark ML on NVIDIA GPUs with Zero Code Change
- TechBlog: Mastering LLM Techniques: LLMOps
- Forum: Dive Deeper into topics related to accelerated compute and optimization approaches in our GPU Cloud forum.
Related resources
- GTC session: Scaling Meta’s Infrastructure for Heterogenous AI Use-Cases and Operational Efficiency
- GTC session: Fine-Grained GPU Partitioning for Higher Performance and Energy Efficiency
- GTC session: CUDA Techniques to Maximize Concurrency and System Utilization
- NGC Containers: Validator for NVIDIA GPU Operator
- NGC Containers: CHROMA
- NGC Containers: IndeX
- https://developer.nvidia.com/blog/practical-tips-for-preventing-gpu-fragmentation-for-volcano-scheduler/