Building a Compliant GPU Cloud: From Bare Metal Foundations to Auditable Operations

AI-driven industries such as telecoms, finance and healthcare are demanding both uncompromising performance and provable compliance from their GPU infrastructure. While virtualisation has proven utility when it comes to flexibility, the additional abstraction layers create extra surfaces to audit and defend. This is problematic in highly regulated environments.This has caused customers and regulators to increasingly question whether shared, virtualized infrastructure can deliver deterministic performance and an evidence chain simple enough to satisfy frameworks like GDPR, ISO 27001, or NIST.
Bare-metal GPU clusters address these concerns by offering maximum performance and a shorter proof chain. Workloads can be tied directly to physical hardware, reducing the number of software layers that auditors must trust and enabling clearer traceability of where data is processed. This doesn't eliminate risk, however; it moves it. Firmware, drivers, and orchestration still require rigorous patching and continuous monitoring.
True compliance comes from pairing bare-metal performance with disciplined organisational controls: strong change management, key management, workload placement policies, and comprehensive logging and auditing. The goal is not to promote bare metal as a silver bullet, but to highlight that when high-stakes AI workloads meet strict regulatory obligations, designing for both performance and proof from day one gives enterprises the strongest foundation.
This guide, part of Ori First Principles series, outlines the key technical and organizational pillars required to build a GPU cloud that is both powerful and provably compliant.
Pillar 1: Establish a Hardware Root of Trust
Before a workload ever runs, you must be able to prove that the underlying hardware is authentic and untampered with. This creates an unbroken evidence chain from the silicon up.
- Hardware Attestation and Traceability: Use platform features like Trusted Platform Modules (TPMs) and secure boot to cryptographically verify the integrity of the host. This process ensures that the firmware, bootloader, and operating system have not been compromised. Signed firmware for components like GPUs and NICs further extends this chain of trust.
- Audit Artifact: Logs from the attestation service proving that each node presented a valid cryptographic measurement before being admitted to the cluster. This is crucial evidence for auditors focused on supply chain security and system integrity.
Pillar 2: Enforce Hardware-Level Isolation
Software-based isolation can be bypassed. For regulated workloads, enforcement must happen at the hardware level, creating deterministic, auditable boundaries between tenants and workloads.
- GPU Partitioning (NVIDIA MIG): Multi-Instance GPU (MIG) technology splits a physical GPU into up to seven independent, hardware-isolated instances. Each instance has its own dedicated memory, cache, and compute resources, preventing data leakage or performance interference. Even in single-tenant clusters, MIG is invaluable for:
- Right-sizing and efficiency: Running multiple isolated jobs on the same GPU without resource contention.
- Fault Isolation: Containing failures or performance issues within a single internal team or application.
- Compliance: Separating production, testing, and development environments or different data domains on the same physical hardware without requiring a hypervisor.
- Network Segmentation (SmartNICs/DPUs and Overlays): Isolate tenant traffic using a combination of technologies:
- EVPN/VXLAN: Network overlays that create logical L2/L3 networks on top of the physical fabric.
- InfiniBand Partitioning: Divides RDMA fabrics into isolated segments.
- SmartNICs/DPUs: These are critical for modern compliant infrastructure. SmartNICs not only accelerate packet processing and RoCE encryption for performance gains but also provide hardware-enforced segmentation, audibility, and fine-grained telemetry for compliance reporting. By offloading policy enforcement from the host CPU, they create a more secure and performant isolation boundary.
Pillar 3: Automate Placement with Compliance-Aware Scheduling
Hardware primitives are only effective if workloads are correctly placed on them. The scheduler is the brain of the operation, turning policy into automated enforcement.
- Policy-Driven Placement: The scheduler must be able to tag workloads with regulatory requirements (e.g., gdpr-zone=frankfurt, hipaa=true, pci-dss=isolated).
- Enforced Scheduling: The scheduler uses these tags to place workloads only on nodes that have been certified and configured to meet those specific requirements.
- Resilient Compliance: On a node failure, the scheduler must intelligently reschedule workloads to another compliant node, preventing a HIPAA workload from accidentally landing in a general-purpose pool.
Pillar 4: Ensure Continuous Audibility
Compliance requires not just prevention but also proof. Every significant action in the cluster must be logged and made available for auditors and security teams.
- Comprehensive Logging: Capture a complete record of events, including:
- GPU allocations and MIG slice assignments.
- Hardware attestation results.
- Workload placement decisions and the compliance tags that drove them.
- Network policy changes, especially those configured on SmartNICs/DPUs.
- SIEM Export: All logs must be streamed to a central Security Information and Event Management (SIEM) system. This integrates the GPU cluster into the organization's broader security and compliance monitoring framework, allowing for correlation and alerting.
Pillar 5: Implement Robust Organizational Governance
Technology alone is not enough. The strongest technical controls can be undermined by weak operational processes.
- Rigorous Change Control: All changes to the cluster configuration—from firmware updates to network policies—must go through a formal, auditable approval process.
- Strict Key Management: Securely manage the lifecycle of all cryptographic keys used for hardware attestation, data encryption, and secure boot.
- Documented Operational Runbooks: Maintain clear procedures for everything from node provisioning and decommissioning to incident response, ensuring that compliant processes are followed consistently.
Conclusion: Designing for Performance and Proof
The demand for high-performance, compliant AI infrastructure is undeniable. While bare metal provides a powerful foundation by simplifying the audit trail and maximizing performance, we also want to be transparent that it is not a complete solution. It is superior to virtualization, but is not a panacea.
A truly robust and auditable GPU cloud is built on a holistic strategy that combines a hardware root of trust, hardware-enforced isolation via MIG and SmartNICs, policy-driven scheduling, and comprehensive logging. Critically, these technical pillars must be supported by disciplined organizational governance. By designing for both performance and proof from day one, enterprises can build the strong foundation needed to unlock the full potential of AI in regulated industries.
Appendix: Mapping Technical Controls to Compliance Frameworks
| Framework / Obligation | Relevant Control Requirement | Technical Measures in a GPU Cluster | Evidence / Audit Artifact |
|---|---|---|---|
| ISO/IEC 27001 | A.9 Access Control, A.12 Operations Security, A.13 Communications Security | Hardware-level isolation with MIG and SR-IOV; network segmentation with EVPN/VXLAN and SmartNICs; encrypted RoCE traffic. | Logs of MIG slice assignments; SmartNIC policy configs; network policy configs; encryption status reports. |
| SOC 2 (Security & Availability) | Logical & physical access controls, system monitoring, change management | Bare-metal provisioning for traceability; compliance-aware scheduler; comprehensive audit logging of GPU/NIC allocation. | Scheduler placement logs; provisioning events; change management records; SIEM exports. |
| NIST 800-53 (Rev. 5) | SC-7 Boundary Protection, AU-2 Audit Events, SI-2 System Integrity | InfiniBand partitioning; SmartNIC/DPU policy offload; automated failover with compliance-aware policies; TPM-based attestation. | Boundary policy configs from SmartNICs; workload placement logs; secure boot and attestation logs. |
| GDPR | Data localization, data isolation, lawful processing | Tenancy models mapped to workload categories; policy-aware placement ensuring data remains in specific datacenters. | Geotagged placement logs; scheduler policies enforcing cluster residency; data-in-transit encryption reports. |
| HIPAA | PHI isolation, auditability of access, secure transmission | Dedicated tenancy for healthcare workloads; encrypted data in transit (RoCE via SmartNIC); workload-to-GPU traceability. | Encryption configs; GPU allocation records tied to specific workload and user IDs; access logs. |
| Telco Lawful Intercept (LI) | Dedicated infrastructure, chain-of-custody for sensitive workloads | Exclusive tenancy with node-level isolation; workload tagging and enforced placement on physically segregated hardware. | Node allocation logs showing physical segregation; scheduler enforcement reports; SIEM alerts on LI workload movement. |
| PCI-DSS | Segmentation of cardholder data environments, logging, access control | Private tenancy for CDE workloads on isolated network segments enforced by SmartNICs; encrypted storage namespaces. | SmartNIC firewall rule sets; namespace allocation logs; access logs; compliance dashboard exports. |
