Maximizing GPU Efficiency for AI Workloads with Ori's Intelligent Scheduler

The challenge is that most orchestration platforms treat these expensive, specialized accelerators as generic, indivisible resources. This creates a mismatch between hardware capability and scheduling logic - leading to poor capacity utilization.
While there are certainly exogenous reasons for idle time, there are also technical reasons. This post addresses the technical causes for idle time and how software can address hardware utilization challenges.
Specifically, Ori built its own Intelligent Scheduler and Dynamic GPU Allocation engine. It is a cloud-native scheduling solution designed to place workloads intelligently across clusters and assign GPU resources with surgical precision, right down to the container level.
We proved this out on our public cloud but the work also has immense benefits to a customer licensing the software for their own GPU cloud, a white-label cloud or just their own enterprise, on-prem cloud. The functional outcome is that the Intelligent Scheduler will lower capex and opex. Let’s learn how.
Inside the Intelligent Scheduler: Core Capabilities
To solve the problem of underutilization, Ori’s scheduler was written with several GPU-aware primitives:
1. Node Management
One of the harder challenges in GPU clusters is that a single node can rarely serve multiple workload types efficiently. Switching a node from VM passthrough to containerized workloads, or from training to inference, generally requires re-imaging, driver changes and several minutes of downtime. This compounds across hundreds of nodes. An alternative approach, requiring a dedicated operations team to perform this monitoring and reactive work by hand, has its own set of scalability issues.
Ori’s Node Manager is an event-driven operator that automates node role assignment. At its core, it uses dynamic labeling and service enablement to declare what a node is capable of in real time - whether that’s running VMs, Kubernetes pods, or inference endpoints.
The Node Manager also handles GPU mode transitions. For example, pre-allocating idle nodes into VM passthrough mode ahead of a large training job, then reclaiming those nodes back into container mode for bursty inference traffic. By orchestrating these transitions programmatically, the system avoids the typical 3–5 minute penalty of manual reconfiguration and ensures the cluster stays responsive to workload surges.
This level of automation means nodes are always in the right “persona” at the right time. It reduces operational overhead, minimizes bottlenecks and allows the cluster to serve a heterogeneous mix of workloads - bare-metal supercomputing, fractional GPU containers, or VM-based jobs - without manual ops intervention.
2. Automated MIG Management and Fractional Sharing
The platform can partition a single physical GPU into up to seven hardware-isolated instances by leveraging NVIDIA's Multi-Instance GPU (MIG) technology. This means a single GPU can securely handle multiple smaller workloads (like inference or notebooks) simultaneously.
While everyone has access to MIG, Ori went a step further - with automated MIG management. Here our scheduler will dynamically change a GPU’s MIG profile based on the demands of the queue. If a stream of small jobs arrives, it can partition GPUs to serve them; if a large training job arrives, it can reconfigure a GPU back into a single, dedicated instance. This seamless lifecycle management is a complex task that is often beyond the scope of a general-purpose scheduler.
3. Automated Failover
In large GPU clusters, hardware and software failures are not edge cases - they are a daily reality. A single GPU throwing persistent Xid errors, a NIC dropping packets under load, or a node going dark due to power loss can all derail running jobs if recovery isn’t handled automatically. Facebook, even with its sophisticated architecture, experienced an average of one GPU-associated failure every three hours.
Ori’s control plane integrates failover directly into the scheduler and Node Manager, making recovery an event-driven process rather than a manual one. When a failure is detected, the system immediately cordons the node, drains active workloads, and reschedules them onto healthy hardware. This tight integration of workload scheduling and hardware management means MTTR is measured in minutes, not hours and requires no SRE intervention.
For operators, the outcome is predictable availability even at scale. That translates directly to less operational overhead, more automation and less reactivity.
4. Topology-Aware Bin-Packing
At a basic level, the scheduler uses intelligent node-level bin-packing to "pack" workloads onto nodes, minimizing resource fragmentation. But true optimization requires looking beyond the node and into the data center's physical topology.
This is the principle of Topology-Aware Bin-Packing. For distributed workloads, the scheduler models the network fabric itself. Consider an 8-pod training job using NCCL for communication. A generic scheduler might scatter those pods across racks, forcing traffic over slower spine switches and creating a bottleneck. Our scheduler understands the communication pattern and will place all 8 pods on nodes connected to the same leaf switch, ensuring the fastest possible inter-node communication and reducing training times.
5. Suspend and resume compute effortlessly
On Ori, suspending and resuming resources is simple. With a single click, you can pause virtual machines and Kubernetes clusters, freeing GPUs and stopping consumption. Another click brings them back online exactly where they left off. This not only maximizes GPU utilization, but also enables geographically distributed teams to hand off resources seamlessly. For example, a team in Europe suspends their use after their work day, which frees up capacity for another team in North America.
Inference workloads gain similar efficiency with scale-to-zero endpoints. When endpoints sit idle for a user-specified duration of time, replicas are dropped to zero, eliminating GPU usage entirely until traffic resumes. With cold-starts under 5 seconds, getting back to serving customers after idle periods is incredibly quick.

Enabling Unified Infrastructure for Mixed Workloads
AI/ML workloads vary widely — from large-scale training jobs that consume entire GPUs, to low-latency inference services that need many small, elastic slices. Traditionally, organizations build separate clusters for each, introducing silos, higher costs, and more operational overhead.
With a purpose-built, GPU-aware scheduler, a single unified cluster can support the full spectrum of workloads:
- A large training job is placed on a full H100 GPU.
- An inference service pod using just 1/7th of an H100 runs on a MIG slice — alongside six other inference pods on the same device.
- Data preprocessing, fine-tuning, and other tasks are automatically fit where resources are available.
But when we say “mixed workloads,” we mean more than just training vs. inference. Ori unifies the entire stack:
- Bare metal provisioning
- VMs and Kubernetes orchestration
- Higher-level AI services like inference, fine-tuning, and training
All of it is covered, managed, and optimized under a single platform. The result is simplified MLOps pipelines, dynamic repurposing of hardware, and seamless scaling from a few nodes to many thousands.
During the day, a cluster might serve low-latency inference; overnight, those same GPUs can be reallocated to large-batch training. The goal is to ensure the hardware is always delivering value - from bare-metal to fine-tuning. At every layer of the stack, Ori ensures that hardware is always delivering value. By having the scheduler operate at the hardware-aware control plane, it can accommodate new interconnects (NVLink Switch, CXL) and new accelerators (AMD, ARM) enter the ecosystem.
Conclusion
At its core, Ori’s Intelligent Scheduler treats GPUs as first-class, heterogeneous resources instead of generic nodes. It automates node role changes, manages MIG profiles on the fly, integrates failure handling directly into scheduling, and places jobs with awareness of the network fabric. This means fewer idle devices, faster recovery from faults, and better packing of mixed workloads across a shared cluster. For operators, the result is a system that reduces manual intervention and improves utilization without adding complexity.
If you're ready to see how a scheduler built for the modern AI stack can impact your environment, schedule a technical demo with our team.

