NVIDIA Rubin CPX and the future of LLM inference

Inference is already the dominant and growing component of AI compute and spend. A report by Tirias Research predicts that the annual rate of inference token generation is expected to skyrocket from 677 trillion in 2024 to 77,000 trillion (77 quadrillion) tokens by the end of 2030. A key reason for the explosion of inference tokens is the rise of multi-step reasoning engines, following the launch of DeepSeek R1 model in January 2025. Barclays anticipates that agentic tasks can generate up to 25x more tokens per query than simple chat, due to multi‑step reasoning and tool use. Over the next few years, success of AI infrastructure will hinge on how efficiently we serve inference tokens: fast, predictable, and at scale.

Source: Tirias Research

Demystifying LLM inference: two distinct phases with opposite needs

LLM (Large language model) inference has been the most popular use of generative AI so far, ranging from programming use to chatbots, customer support, language translation, scientific research and much more. However, LLM serving is not one workload; it’s two.

LLMs typically produce text step by step, with each token generated only after the model has taken into account every token that came before. This means the first predicted token is based solely on the prompt, the second depends on the prompt plus the first token, the third depends on the prompt plus the first two tokens, and so forth.

When attention mechanisms are added, generating the next token requires computing query, key, and value vectors for all previous tokens. To avoid recomputing these repeatedly, models make use of a key–value (KV) cache. With caching, each new token requires only one additional set of key and value vectors, which is appended to the cache. For the very first generated token, however, the cache is empty. The model must compute key and value vectors for the entire input prompt, but since all prompt tokens are available upfront, these calculations can be run in parallel.

This contrast gives rise to two distinct phases: the prefill phase, where computations for all prompt tokens can be parallelized to produce the first output token, and the decode phase, where subsequent tokens must be generated sequentially, one at a time.

Prefill (context build). The model ingests the user’s entire prompt, constructs the key–value (KV) cache, and emits the first token. Because all input tokens are known up‑front, kernels run in parallel, so the work is compute‑bound; this phase largely sets time‑to‑first‑token (TTFT).
Decode (generation). follows, generating tokens one‑by‑one while re‑reading that KV cache every step. This phase is memory bandwidth and capacity‑bound; it governs inter‑token latency (ITL) and affects overall inference throughput.

Why this matters: On traditional GPUs, compute‑hungry prefills and bandwidth‑hungry decodes can slow each other down, even with continuous batching. Batching techniques like chunked prefill (splitting prompts into slices) improve “goodput” under mixed traffic by letting decodes “breathe” between prefill chunks, yet they don’t remove the fundamental tension.

Build your own AI cloud with Ori AI Fabric, the platform that powers our cloud.

License Ori AI Fabric

Compute disaggregation: why separating Prefill and Decode is the future of serving

GPUs for AI typically feature large quantities of HBM (high bandwidth memory) by stacking advanced DRAM in 3D packaging for extreme performance and low latency. HBM memory is expensive and is a significant portion of the unit cost of advanced datacenter-grade NVIDIA Hopper, Blackwell and Rubin GPUs. It is extremely useful in the decode phase especially to accommodate larger context sizes and sequence lengths, which need larger KV caches. However, prefill is compute-intensive but light on memory usage, so much of the expensive HBM goes unutilized during this phase of inference.

Disaggregated inference is NVIDIA’s answer to this problem by splitting the inference GPU into two: runs the two phases on separate pools and passes the KV cache between them. Benefits:

Responsiveness: Compute‑heavy prefills stop interrupting latency‑sensitive decodes which enables lower TTFT and smoother ITL under load.
Right‑sizing: Provision Prefill-optimized Rubin CPX and blend HBM-heavy Rubin where Decode needs it, enabling granular performance control.
Better tokenomics: Disaggregated compute enhances throughput per rack while reducing compute spend and energy usage.

NVIDIA Rubin CPX for optimized inference

Rubin CPX is a purpose‑built, context‑optimized GPU that accelerates the compute‑bound phase of inference. Built on the Rubin architecture, CPX delivers up to 30 petaFLOPS (NVFP4), pairs that throughput with 128 GB of cost‑efficient GDDR7, and adds hardware attention acceleration (≈3× vs. GB300 NVL72) to keep performance high as context windows stretch toward a million tokens. Semianalysis estimates that the Rubin CPX’s GDDR7 memory is 5 times cost-effective compared to HBM.

Pipeline role	Hardware	Memory	Optimized for	Typical benefits
Prefill (context)	Rubin CPX	128 GB GDDR7	Low‑precision compute throughput and accelerated attention for long‑context ingest (code, video)	Lower TTFT; heavy prefills isolated from decodes.
Decode (generation)	Standard Rubin GPU	HBM4, NVLink	High bandwidth + large capacity to read KV every token	Higher TPS, steadier tail latency (p95/p99) under load

Rubin CPX is designed to operate alongside NVIDIA Vera CPUs and Rubin GPUs for long context inference. The NVIDIA Vera Rubin NVL144 CPX rack integrates 144 Rubin CPX GPUs, 144 Rubin GPUs, and 36 Vera CPUs to deliver 8 exaFLOPs of NVFP4 compute which is 7.5x more than the GB300 NVL72.

NVIDIA Vera Rubin NVL 144 with Rubin CPX

Source: NVIDIA

Rubin CPX has the potential to reshape inference across multiple dimensions. By separating compute-intensive prefills from decodes, it reduces contention between the two, lowering time-to-first-token (TTFT) and stabilizing metrics like inter-token latency (ITL) and tokens-per-second (TPS), even with long prompts and concurrent requests.

On the economics side, throughput per dollar improves because high-bandwidth memory (HBM) is concentrated where it matters most during decode while compute-dense, GDDR-based CPX handles the prefill phase. Perhaps most transformative, Rubin CPX enables widespread million-token contexts: with attention acceleration and built-in media engines, workloads such as repository-scale coding or generative video become feasible at scale, while HBM-rich Rubin GPUs ensure smooth, uninterrupted decode performance.

Leverage the best of NVIDIA on Ori Inference

Rubin CPX is expected to be available toward the end of 2026, and NVIDIA estimates that Rubin CPX platform at scale can deliver 30x to 50x return on investment, translating to as much as $5B in revenue from a $100M capex investment. Ori AI Fabric enables you to build your own global inference cloud to make the most of top-tier NVIDIA GPUs and deliver the most flexible inference platform:

Serve any model, foundational or custom in a multi-framework runtime environment.
Autoscaling with demand (including scale-to-zero) so users pay only for what they use while keeping up with incoming requests.
Built-in authentication, DNS management and native integration with Registry and Fine tuning for simple, secure and instant deployments.
Supports both serverless for token-based usage and dedicated GPUs for strict performance or security needs.
Automatic routing to regions with the lowest latency or sovereign locations.

Ori’s Inference Delivery Network delivers cold starts under 5 seconds, compared to the dozens of seconds, or even minutes on other platforms. The sheer speed, combined with localized deployments, is why customers across the globe trust Ori to deliver low-latency inference. Seamlessly integrated with Ori Fine-tuning Studio and Model Registry, deploying a model to Inference Endpoints is just one click away.

Power your AI ambitions with the Ori AI Cloud

Looking to harness the performance of NVIDIA Blackwell and Hopper GPUs for your AI-driven business? Ori’s end-to-end cloud is built specifically for AI/ML workloads, giving you direct access to the world’s most advanced GPUs, high-performance storage, and AI-optimized networking so you can:

Launch Supercomputers in minutes, with NVIDIA GPUDirect and shared storage.
Spin up GPU Instances as on-demand virtual machines.
Deploy purpose-built GPU Clusters for the highest level of performance
Manage AI workloads on Serverless Kubernetes without infrastructure overhead.

Train, scale and serve AI on Ori

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.