Beyond the VPC: Architecting End-to-End Secure Segmentation for AI Clouds

Multi-tenancy in the era of high-performance AI infrastructure presents a fundamental engineering trilemma: how does one deliver the uncompromised performance of bare metal, the verifiable security of a private cloud, and the operational flexibility of a shared, elastic platform? For years, the industry’s answer has been a patchwork of virtual private clouds (VPCs), software-defined networks (SDNs), and access control lists—a model designed for general-purpose computing that is inadequate for the unique demands of GPU-centric workloads. For AI clouds, the stakes go beyond security alone. Every microsecond of latency or slice of GPU left idle is a direct hit to ROI. Segmentation must therefore guarantee not only isolation but also predictable performance, ensuring expensive accelerators are never wasted on noisy neighbors or throttled by software overlays.That’s why conventional approaches are economically unsustainable for AI-class workloads.
These conventional models treat security and segmentation as a software layer bolted on top of the infrastructure. For AI and HPC, this is an architectural flaw. The extreme throughput and low-latency requirements of distributed training and real-time inference demand a different approach—one where segmentation is not an afterthought but a core principle, enforced from the silicon up.
This post, from Ori’s First Principles series, is a deep dive into the Ori platform's architecture for secure segmentation. We will explore how we moved beyond the limitations of the traditional VPC model to build a unified, hardware-accelerated framework that provides end-to-end isolation across compute, storage and networking - all of it governed by a single, distributed control plane.
The Shortcomings of Conventional Cloud Segmentation
The standard toolkit for multi-tenancy in the cloud: VPCs, security groups and IAM policies, is powerful but insufficient for the performance and security guarantees required by AI. This model introduces several critical limitations.
- The Performance Penalty of Software Overlays: Traditional SDN relies heavily on the hypervisor to encapsulate and direct network traffic. Every packet must traverse a virtual switch, adding latency and consuming valuable CPU cycles that would otherwise be available to the workload. For RDMA-dependent technologies like InfiniBand and RoCE, this software-in-the-middle approach can cripple performance, effectively negating the benefits of the high-performance fabric.
- Lack of GPU Awareness: Conventional segmentation models are blind to the internal architecture of a GPU. They treat a multi-thousand-dollar accelerator as a monolithic PCI device. They lack the primitives to understand, let alone enforce, isolation within a single GPU, leading to inefficient resource allocation and an inability to securely share a single accelerator among multiple, untrusted tenants.
- Fragmented Management and Policy Gaps: The security policies for compute (VM access), storage (bucket policies), and networking (security groups, NACLs) are typically configured in separate, siloed systems. This fragmentation creates immense operational complexity, increases the risk of misconfiguration and expands the attack surface. There is often no single source of truth for a tenant's end-to-end security posture, making comprehensive audits and governance difficult.These fragmented tools were never designed to be a single, cohesive security fabric, forcing operators to manage policy gaps between what their VPC allows, what their storage system enforces, and what their physical hardware can do.
- The Persistent "Noisy Neighbor" Problem: While logical separation can be achieved, tenants often still compete for shared physical resources—the network backbone, the storage appliance's front-end or the server's PCIe bus. Without deeper, hardware-aware isolation, one tenant's I/O-heavy workload can create unpredictable performance degradation for another, a classic "noisy neighbor" problem that is unacceptable for mission-critical AI jobs.
AWS, Azure, and GCP provide VPCs, security groups, and some GPU sharing primitives, but they were built for general compute. None unify segmentation across GPU, storage, and network under one control plane.
A Unified, Hardware-First Approach
Our architectural philosophy is simple: security and segmentation policies must be defined centrally but enforced as close to the silicon as possible.
Instead of relying solely on software, the Ori platform leverages a suite of hardware-level technologies to create verifiable, performance-isolated environments. The Ori Global Control Plane acts as the central brain, defining a tenant's entire security context—their compute partitions, their virtual network fabrics, and their storage access rights. These policies are then pushed down and enforced by the hardware itself across a globally distributed, multi-vendor fleet of servers.
The result is a platform that delivers the verifiable isolation guarantees of physically separate infrastructure with the economic and operational flexibility of a shared, multi-tenant cloud.
Deep Dive: Segmenting the Compute Layer
For AI, the compute layer is the GPU. Sharing this powerful and expensive resource securely and efficiently is the primary challenge.
NVIDIA MIG (Multi-Instance GPU)
For NVIDIA GPUs, we leverage MIG to partition a single accelerator into up to seven independent, hardware-isolated instances. This is not a software-based time-slicing or containerization; MIG is a physical partitioning of the GPU's internal resources. Each MIG instance receives its own dedicated set of Streaming Multiprocessors (SMs), its own portion of the L2 cache, and its own memory controllers and DRAM address paths.
To the operating system and the workload, each MIG instance appears as a distinct, standalone GPU. This provides a hard-wired guarantee of performance isolation—one tenant's high-intensity workload on one instance cannot impact the latency or throughput of another tenant's workload on a different instance on the same physical GPU. The Ori scheduler is fully MIG-aware, placing containerized and VM-based workloads onto these fractional instances to maximize utilization without compromising security. This awareness allows the scheduler to make intelligent cost-saving decisions, such as 'bin-packing' multiple, smaller inference workloads securely onto a single, partitioned GPU, maximizing the return on every accelerator.
AMD and SR-IOV for Accelerators
For other accelerators like AMD GPUs, we utilize SR-IOV (Single Root I/O Virtualization). SR-IOV is a PCIe specification that allows a single physical device to appear on the bus as multiple, independent, lightweight virtual devices known as Virtual Functions (VFs). Each VF has its own dedicated I/O path and can be directly assigned to a different VM or container. This bypasses the hypervisor entirely for I/O operations, providing a secure, direct-from-hardware path for tenants to their slice of the accelerator.
Deep Dive: Isolating the Network Fabric
AI networking is a world of extremes, requiring both the massive east-west bandwidth of InfiniBand for training and the resilient, scalable connectivity of Ethernet for inference and storage. Our approach segments both.
High-Performance Virtual Fabrics with EVPN/VXLAN
We build our multi-tenant (North-South) Ethernet fabrics using a combination of VXLAN (Virtual Extensible LAN) and BGP EVPN (Ethernet VPN).
- VXLAN is the data plane protocol. It encapsulates a tenant's Layer 2 traffic into Layer 3 UDP packets, creating a virtual "tunnel" or overlay network that can span the entire physical data center fabric.
- BGP EVPN is the modern standard for the control plane. It uses the battle-tested Border Gateway Protocol—the protocol that runs the internet—to advertise and learn the location of tenant endpoints (their MAC and IP addresses) across the fabric.
This combination allows us to programmatically create thousands of fully isolated, Layer 2 virtual networks on demand. When a new tenant is created, our control plane instructs the network fabric to create a new VXLAN Network Identifier (VNI) and uses BGP to establish the reachability policies for that tenant's private network. All traffic is isolated within its tunnel, invisible to other tenants.
Preserving Performance with SR-IOV and RoCE v2
To ensure this virtual fabric doesn't come with a performance penalty, we again utilize SR-IOV, this time on the network interface cards (NICs). By assigning a VF of a high-speed NIC directly to a tenant's VM or container, we allow their traffic to completely bypass the host's software network stack and flow directly from the workload to the hardware. This is how we preserve the near bare-metal latency required for real-time inference serving.
Once this direct hardware path is established, we enable RoCE v2. This protocol brings the benefits of RDMA—Direct Memory Access—to Ethernet. It allows data to move from the memory of one server directly into the memory of another without involving the CPU.
This allows our Ethernet fabric to achieve low-latency, high-throughput performance that rivals InfiniBand, which is critical for high-speed distributed storage and real-time inference serving, all while maintaining the routability and flexibility of standard Ethernet.
SmartNIC/DPU
Moving the SDN/Network overlay off the CPU and onto a SmartNIC/DPU (like NVIDIA BlueField), returns the CPU to the tenant and allows "bare-metal" network speeds with cloud-like security.The VXLAN overlay and security rules are set on the Smart DPU, the security is physically separated from the tenant. Even if the tenant gets root access on the bare metal server, they cannot change the network rules because those rules are running on the DPU card, not the main CPU. This allows you to offer "Bare Metal" without losing control of the network. Ori leverages these technologies to provide customers with the best performance and the highest security in multi-tenanted environments.
InfiniBand Partitioning
For the highest-performance training clusters, we use InfiniBand partitioning with PKeys (Partition Keys). PKeys function similar to ethernet VLANs for InfiniBand, just stricter, creating logically isolated communication zones within the fabric for East-West traffic. The Ori Control Plane automatically configures the Subnet Manager to assign unique PKeys to each tenant's cluster, guaranteeing that the RDMA traffic from one massive training job cannot interfere with another.
Deep Dive: Securing the Storage Layer
Secure storage in a multi-tenant environment requires isolation at multiple levels, from the logical volume down to the physical network path.
Our platform integrates with high-performance parallel file systems and object stores and enforces tenant separation through a combination of mechanisms. Logically, tenants are confined to their own volumes, preventing any possibility of unauthorized cross-tenant data access. All access is governed by fine-grained, policy-based access controls (PBAC), which define not just who can access data, but from where and under what conditions. To protect the data itself, we enforce encryption both at-rest on the storage media and in-transit as it moves across the network.
These mechanisms are not theoretical. In practice, they allow diverse workloads to coexist securely on the same platform. Giotto, for example, runs private tenancy bare-metal clusters for large-scale training while using strict tenancy VMs for inference. nCompass and SumerSports take advantage of shared tenancy serverless Kubernetes for inference at scale, benefitting from GPU availability across multiple regions. Locai required strict tenancy with guaranteed UK-based GPUs and control plane components to build a sovereign AI model. This multi-level approach is critical for the regulated sectors Ori serves, providing the technical underpinnings needed to meet the stringent compliance and data sovereignty requirements of finance, healthcare, and government.
Bringing It All Together: Tenancy Models in Practice
This unified, hardware-accelerated framework gives us the flexibility to offer a spectrum of tenancy models on a single, shared cluster, catering to diverse customer needs.
- Soft Tenancy: Ideal for development workloads and cost-sensitive startups, this model uses logical isolation (Kubernetes namespaces, VXLAN overlays) to share resources efficiently.
- Strict Tenancy: For customers who require stronger guarantees for compliance, this model dedicates hardware-level resources like MIG instances or entire physical nodes to a specific tenant, all while running under a shared control plane.
- Private Tenancy: The ultimate in security, this model provides our most security-conscious customers with a fully dedicated set of physical nodes and a private instance of the control plane, ensuring complete data and operational isolation.
These tenancy models also map cleanly to regulatory requirements. Soft tenancy is well-suited for internal development or non-regulated workloads. Strict tenancy aligns with industries such as finance and healthcare, where regulators demand hardware-level guarantees for isolation and auditability. Private tenancy is often required for government, defense, and sovereign AI initiatives, where workloads must operate on physically dedicated infrastructure with full control plane separation. By aligning tenancy options with compliance categories, operators can satisfy regulators without sacrificing efficiency.
The Ori platform can provision any of these environments programmatically in minutes, a capability that sets a new standard for flexible, secure AI cloud infrastructure. This is not just a theoretical architecture; it is the proven foundation upon which our global public cloud operates, serving hundreds of customers with diverse security and performance requirements every day.
Summary
Ultimately, secure multi-tenancy for AI is an end-to-end architectural challenge that cannot be solved by software overlays alone. By building on a foundation of hardware-enforced segmentation across compute, storage, and networking, the Ori platform provides the verifiable security of a private cloud without sacrificing the performance and flexibility modern AI workloads demand.

