A deep dive into NVIDIA’s Blackwell platform

NVIDIA unveiled its latest GPU platform Blackwell at GTC earlier this year. This new platform, named after the pioneering mathematician and statistician David Blackwell, includes two powerful GPUs - the B100 and B200, as well as the GB200 supercomputer series. In this blog post, we explore what makes Blackwell GPUs unique and how they can unleash the next wave of AI computing.

What’s new in NVIDIA’s next generation of GPUs

AI Superchip: Each Blackwell superchip consists of two dies connected by 10TB/s C2C (chip-to-chip) interconnect, coming together as a single GPU with full cache coherence. These dies built with TSMC’s custom 4NP fabrication process feature a whopping 208 billion transistors compared to the 80 billion transistors in Hopper.

The new NVIDIA Blackwell chips offer larger memory capacity for bigger models and more than double the memory bandwidth. This is crucial because a memory wall can prevent large AI models from taking full advantage of GPU processing power. Another key feature of Blackwell is its incredibly fast GPU-to-GPU connection, enabling multiple GPUs to work together as unified compute blocks. Here’s the NVIDIA B200 vs H200 feature comparison:

Attribute	HGX H200 8-GPU	HGX B200 8-GPU
Form factor	8x NVIDIA H100 SXM	8x NVIDIA Blackwell SXM
Aggregate Memory Capacity	Up to 640GB	Up to 1.4TB
Aggregate Memory Bandwidth	Up to 27TB/s	Up to 62TB/s
Aggregate NVLink Interconnect	7.2TB/s	14.4TB/s
NVLink	4th Gen	5th Gen
NVSwitch	3rd Gen	4th Gen

2nd Gen Transformer Engine: features 5th Gen Tensor cores that support new quantization formats and precisions. This engine will greatly speed up the inference of Mixture of Experts (MoE) models by using dynamic range management and advanced microscaling formats. The dynamic range allows the engine to adjust and refine numerical formats to lower precision, continuously optimizing the models for better performance.

Precision type	HGX H100 8-GPU	HGX B200 8-GPU
Tensor Core precisions	FP64, TF32, BF16, FP16, FP8, INT8	FP64, TF32, BF16, FP16, FP8, INT8, FP6, FP4
CUDA® Core precisions	FP64, FP32, FP16, BF16, INT8	FP64, FP32, FP16, BF16

Source: NVIDIA - Blackwell HGX performance data

MoE models are significantly faster when it comes to running inference when compared to equivalent non-expert models due to the efficiency of conditional computing and sparsity from expert parallelism. However, these models require more VRAM because the system must load all experts and their parameters into memory. In addition to the substantially higher memory capacity and bandwidth, Blackwell’s lower precision formats and micro scaling help alleviate this problem by enabling larger models with more parameters to fit into GPUs.

Source: NVIDIA - Blackwell HGX performance data

This paper on Microscaling (MX) formats for generative AI discusses benchmark findings, showcasing the impressive potential of smaller precision formats for both training and inference, with only minor accuracy losses. As smaller precision formats advance, more ML developers are likely to embrace these innovations for model development.

Source: NVIDIA - Blackwell HGX performance data

The new Transformer Engine speeds up LLM training by enhancing the Nemo Framework and integrating expert parallelism techniques from Megatron-Core. We expect these advancements to pave the way for creating the first 10 trillion parameter model. 5th Generation NVLink: At 1.8TB/s bidirectional throughput per GPU, this new generation of GPU to GPU interconnect is twice as fast as the previous gen and can enable seamless high-speed communication among up to 576 GPUs. Accelerated in-network computation makes NCCL collective operations more efficient and helps GPUs reach synchronization faster. The latest generation of NVLink NVSwitch enables multi-GPU clusters such as the GB200 NVL72 for an accumulative bandwidth of 130TB/s for large models.

Confidential computing enhancements: The latest Blackwell GPUs now feature Trusted Execution Environment (TEE) technology. While CPUs have supported TEE for a long time to ensure data confidentiality and integrity in applications like content authentication and secure financial transactions, NVIDIA GPUs now also offer TEE-I/O capabilities. This means enhanced data protection through inline protection on NVLink connections. Additionally, Blackwell GPUs provide data encryption at rest, in motion, and during computation.

Superfast decompression for data analytics: Blackwell can decompress data at a blistering 800GB/s speed with formats such as LZ4, Snappy and Deflate. The GB200 GPU charged by 8TB/s bandwidth of HBM3e (High Bandwidth Memory) and the lightning-fast NVLink-C2C interconnect of the Grace CPU makes the data pipeline extremely fast. NVIDIA’s benchmarks run on a GB200 GPU cluster reveal 18x faster queries/sec than a traditional CPU and 6x faster than an H100 GPU, making GPUs more suitable for data analytics and database workflows.

Reliability, availability and serviceability (RAS) engine: performs automatic, built-in tests on computational cores and memory in the Blackwell chip. This is especially important for large supercomputer clusters, as it allows teams to replace underperforming GPU boards and keep performance high while protecting their GPU investments.

Understanding the Blackwell GPU lineup: B200 vs B300 vs GB200

The NVIDIA Blackwell family of GPU-based systems comprises HGX B100, HGX B200, DGX B200, and NVIDIA DGX supercomputers such as GB200 NVL36, and GB200 NVL72. The table below lists their specs and performance benchmarks provided by NVIDIA:

Attribute	HGX B200	HGX B300	GB200 NVL72
Form factor	8x NVIDIA B200 SXM	8x NVIDIA Blackwell Ultra SXM	1:2 GB Board (36 Grace CPU:72 Blackwell GPUs)
CPU Platform	x86	x86	Grace powered by 2,592 Arm® Neoverse V2 cores
Aggregate Memory	1.4 TB	Up to 2.3 TB	Up to 30TB
Total NVLink Bandwidth	14.4 TB/s	14.4 TB/s	130TB/s
FP4 Tensor Core	144 PFLOPS	144 PFLOPS	1,440 PFLOPS
FP8/FP6 Tensor Core	72 PFLOPS	72 PFLOPS	720 PFLOPS
INT8 Tensor Core	72 POPS	2 POPS	720 POPS
FP16/BF16 Tensor Core	36 PFLOPS	36 PFLOPS	360 PFLOPS
TF32 Tensor Core	18 PFLOPS	18 PFLOPS	180 PFLOPS
FP32	600 TFLOPS	600 TFLOPS	5760 TFLOPS
FP64 / FP64 Tensor Core	296 TFLOPS	10 TFLOPS	2880 TFLOPS

Scale models to multi-trillion parameters with NVIDIA GB200 supercomputers

The GB200 superchip forms the core of GB200 supercomputers, combining 1 Grace CPU and 2 Blackwell GPUs in a memory-coherent, unified memory space. The GB200 system comes in different versions, such as the GB200 NVL36 and GB200 NVL72, depending on the number of GPUs. Each rack can hold 9 or 18 GB200 compute node trays, depending on the design. These racks include cold plates and connections for liquid cooling, PCIe Gen 6 for fast networking, and NVLink connectors for seamless NVLink cable integration.

GB200 NVL36 is one rack of 9x dual-GB200 (4 GPUs, 2 CPUs) compute nodes and 9x NVSwitch trays
GB200 NVL72 can be two racks of 9x trays of dual-GB200 compute nodes and 9x NVSwitch trays
GB200 NVL72 can also be one rack of 18x trays of dual-GB200 compute nodes and 9x NVSwitch trays

Here’s why GB200 supercomputers are perfect to handle the complexity of large models:

AI performance advantage from a massive compute block: The GB200 superchip is equipped with a new transformer engine, fifth-generation NVLink and 1.8 TB/s of GPU-to-GPU interconnect delivers 4X faster training performance for large language models like GPT-MoE-1.8T. The superchip also features InfiniBand networking and NVIDIA Magnum IO™ software ensures efficient scalability of extensive GPU computing clusters with up to 576 GPUs.
Grace CPU with superior LPDDR5X memory: The Grace CPU is a powerhouse of 144 ARM v9 Neoverse cores delivering up to 7.1 TFLOPS of performance and can access 960GB of LPDDR5X RAM at 1TB/s memory bandwidth. This ultra-fast, low-power memory accelerates transactions while maintaining data integrity through error correction code (ECC), making it suitable for critical workloads.
Blazing fast CPU interconnect with simplified NUMA: powered by 900GB/s NVLINK C2C interconnect which is several times faster than traditional PCIe interconnect.

Source: NVIDIA - Grace CPU Whitepaper

Explore use cases for GB200 AI supercomputers

Here are some examples of use cases for supercomputers such as NVIDIA GBL200 NVL72:

Unlocking training for trillion parameter models: With state-of-the-art (SOTA) models increasingly featuring more than trillion parameters, training a 1.8 trillion parameter is 4 times faster than an equivalent H100 GPU cluster.

Scalable inference for powerful generative AI models: A Menlo Ventures study of business leaders on adopting generative AI revealed that 96% of computing spend on generative AI goes towards inference, highlighting the importance of optimizing performance for better ROI. The two main factors to consider when implementing inference are size and speed - businesses aim to offer instant experiences to their users as they transform their products and services with AI, regardless of the size of their customer base.

This is where the GB200 NVL72 cluster becomes crucial, providing up to 30 times better inference performance at real-time speeds bringing the scalability benefits of Blackwell's architecture to practical inference use cases in business and consumer applications.

Seamless execution of Mixture of Experts (MoE) models: The massive aggregate memory of up to 13.5 TB in GB200 systems and incredibly fast GPU interconnect helps AI teams realize the potential of MoE models better than ever before. The visualization below shows how experts in an MoE model communicate with each other and across the model's layers. Without Blackwell’s NVLINK interconnect, NVIDIA estimates that GPUs would spend half their time on communication instead of computation.

Superlative vector database and retrieval-augmented generation (RAG) performance: Grace CPU’s 960GB of memory and 900GB/s C2C link is perfect to accelerate RAG pipelines via low-latency vector search.
Sustainable AI computing: Combined with energy savings from liquid cooling and the efficiency of the GB200 supercomputing system, GB200NVL72 is 25x more energy efficient when compared to an equivalent NVIDIA H100 cluster.

Power your AI with Blackwell GPUs on Ori

Want to leverage the power of NVIDIA Blackwell GPUs for your AI-focused business? Ori’s AI Native cloud is purpose-built for AI/ML workloads featuring top-notch GPUs, performant storage and AI-ready networking so you can:

Deploy AI-Optimized GPU instances for training, finetuning and inference workloads.
Reduce your GPU costs by up to 75% when compared to hyperscale cloud providers.
Scale effortlessly from on-demand to custom private clouds in a variety of configurations ranging from bare metal and virtual machines to Serverless Kubernetes.

Train and serve world-changing AI models on Ori! Reserve your Blackwell GPUs today!

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.