NVIDIA GTC25 Practical Tips for Preventing GPU Fragmentation for Volcano Scheduler : US Pioneer Global VC DIFCHQ SFO Singapore – Riyadh Swiss Our Mind

At NVIDIA, we take pride in tackling complex infrastructure challenges with precision and innovation. When Volcano faced GPU underutilization in their NVIDIA DGX Cloud-provisioned Kubernetes cluster, we stepped in to deliver a solution that not only met but exceeded expectations.

By combining advanced scheduling techniques with a deep understanding of distributed workloads, we achieved around 90% GPU occupancy, well above the contractual target of 80%. Here’s a detailed look at the problem, our approach, and the results.

Problem: GPU fragmentation and scheduling Inefficiencies

The DGX Cloud Kubernetes cluster consisted of thousands of GPUs, each equipped with multiple NVIDIA L40S GPUs. The cluster supported diverse workloads:

  • Multi-node, multi-GPU distributed training jobs
  • Batch inferencing for high-throughput AI models
  • GPU-backed data-processing pipelines
  • Interactive notebooks for development and analytics

Despite the availability of robust hardware, the cluster suffered from GPU fragmentation, where nodes were left partially occupied and unusable for larger jobs. This inefficiency was compounded by the default behavior of the Volcano Scheduler, which used a gang scheduling algorithm.

Without intervention, we risked breaching the contractual agreement to maintain at least 80% GPU occupancy. This would result in reduced cluster capacity as unused resources were reclaimed for other teams.

Key challenges

The implementation of this solution required overcoming two primary obstacles:

  • Gang scheduling’s all-or-nothing approach: Distributed jobs requiring multiple GPUs across nodes were queued indefinitely unless all required resources were available simultaneously. This led to bottlenecks and delayed task execution.
  • Fragmentation due to random placement: Workloads were assigned to nodes based on simple heuristics or random selection, often leaving GPUs scattered across nodes in a fragmented state (Figure 1).

Solution: Integrating bin-packing with gang scheduling

To address these challenges, we implemented an enhanced scheduling strategy by integrating a bin-packing algorithm into the Volcano Scheduler. This approach focused on consolidating workloads to maximize node utilization while leaving other nodes entirely free for larger jobs.

Technical implementation

Our approach to resolving GPU fragmentation involved three key components:

  • Workload prioritization:
    • Resources were ranked in descending order of importance: GPUs, CPUs, and memory.
    • Nodes suitable for incoming workloads were shortlisted based on resource requirements, for example, selectors and affinity rules.
  • Optimized placement through bin-packing:
    • Partially occupied nodes were ranked by their current utilization levels, lowest to highest.
    • Workloads were placed on nodes with the least free resources first, ensuring that nodes became fully utilized before moving to others.
  • Gang scheduling integration:
    • The enhanced scheduler maintained gang scheduling’s all-or-nothing principle but added intelligence to prioritize workload placement based on resource consolidation.

These points were enabled using the following Volcano Scheduler configuration, enabling efficient workload placement and prioritizing node utilization.

volcano-scheduler.conf: |
  actions: "enqueue, allocate, backfill"
  tiers:
  - plugins:
    - name: priority
    - name: gang
      enablePreemptable: false
    - name: conformance
  - plugins:
    - name: drf
    - name: predicates
    - name: proportion
    - name: nodeorder
    - name: binpack
      arguments:
        binpack.weight: 10
        binpack.cpu: 2
        binpack.memory: 2
        binpack.resources: "nvidia.com/gpu"
        binpack.resources.nvidia.com/gpu: 8
        binpack.score.gpu.weight: 10
  - plugins:
    - name: overcommit
    - name: rescheduling

Example scenario

To illustrate the effectiveness of our solution, let’s consider the following scenario:

Initial state

Two nodes—Node A with two GPUs occupied and Node B, which is free—receive a new workload requiring two GPUs.

A flow diagram shows that a new GPU workload request arrives in the Kubernetes cluster, which has both partially and fully free GPU nodes.
Figure 1. A Kubernetes cluster with two nodes 

Default gang scheduling

The workload is placed randomly, for example, on Node B, leaving both nodes partially occupied.

A flow diagram shows that a Kubernetes cluster with two nodes, A and B, and partially occupied GPUs receives a new workload requiring two GPUs and places them on Node B using the default gang scheduling mechanism, lowering the capacity on both nodes.
Figure 2. GPU fragmentation in Volcano Scheduler 

As a result, the cluster contained more partially occupied nodes than completely free nodes. Figure 3 shows that only 18 nodes with all four GPUs were accessible, however there were around 115 nodes with three free GPUs, which cannot be used for a training job that requires four GPUs per node.

A graph shows the fragmented nodes in action, where there are only 18 nodes and four free GPUs to use.
Figure 3. Kubernetes cluster with 18 nodes and all four GPUs accessible

The impact of this fragmentation becomes even more apparent when we examine the distribution of partially occupied nodes.

A graph shows the fragmented nodes in action, where there are a large number of partially occupied nodes. A training task that requires four GPUs per node cannot use partially occupied nodes.
Figure 4. Kubernetes cluster with 115 nodes and three free GPUs

Solution

The workload is placed on Node A, fully using its four GPUs while leaving Node B completely free for future, large-scale jobs.

A flow diagram shows the optimal approach for node selection to mitigate GPU fragmentation in Volcano Scheduler.
Figure 5. A Kubernetes cluster with two optimized nodes

Figure 5 shows that Node A and B, with partially occupied GPUs, receives a new workload requiring two GPUs and places them on Node B using the default gang scheduling mechanism, lowering the capacity on both nodes.

As a result, the cluster had more free nodes than occupied nodes. Figure 6 shows that there are 214 nodes with all four GPUs available, which a training job requiring four GPUs can easily use.

The graph shows the mitigated GPU fragmentation issue, in which there are a large number of fully available GPU nodes for consumption.
Figure 6. Kubernetes cluster with 214 nodes and all four GPUs available

Figure 7 shows that there are approximately nine nodes with three free GPUs.

The graph shows that the GPU fragmentation issue was significantly mitigated. As a result, the cluster only had nine nodes and three free GPUs.
Figure 7. Mitigated GPU fragmentation issue

This strategic placement significantly reduced fragmentation and improved resource availability for large training jobs.

Results achieved

The integration of bin-packing into the Volcano Scheduler transformed the GPU cluster’s performance:

  • Increased resource availability: The number of fully free nodes (with all four GPUs available) increased enabling seamless scheduling of large-scale training jobs.
  • Improved GPU occupancy: Average GPU utilization rose to an industry-leading 90%, far exceeding the 80% contractual requirement.
  • Enhanced cost efficiency: By optimizing resource usage, we avoided capacity reductions and maintained full access to their allocated cluster resources without additional overhead costs.
  • Scalability across workloads: The solution proved effective not only for distributed training jobs but also for batch inferencing and GPU-backed data processing tasks.

Finally, as a result, the whole cluster occupancy averaged out at roughly 90%, which was much beyond the Resource Governance team’s criterion of 80%.

A line graph shows that, after resolving the GPU fragmentation issue, the GPU occupancy hovered around 90%.
Figure 8. DGX Cloud cluster GPU occupancy hovering around 90%

Broader implications for distributed systems

This post highlighted how thoughtful scheduling strategies can resolve pervasive challenges in multi-node multi-GPU clusters.

  • Proactive resource management prevents bottlenecks: By anticipating fragmentation issues and addressing them through optimized placement algorithms, organizations can avoid costly delays and underutilization.
  • Cost-aware engineering enhances ROI: Efficiently using existing infrastructure reduces the need for additional hardware investments while maximizing performance.
  • Flexible scheduling accommodates diverse workloads: The integration of bin-packing algorithms demonstrates how schedulers such as Volcano can adapt to meet specific workload requirements without overhauling existing systems.

Related resources