AI education

A deep dive into NVIDIA’s Blackwell platform

Deepak Manoor
Posted : July, 22, 2024
Posted : July, 22, 2024

    NVIDIA unveiled its latest GPU platform Blackwell at GTC earlier this year. This new platform, named after the pioneering mathematician and statistician David Blackwell, includes two powerful GPUs - the B100 and B200, as well as the GB200 supercomputer series. In this blog post, we explore what makes Blackwell GPUs unique and how they can unleash the next wave of AI computing.

    What’s new in NVIDIA’s next generation of GPUs

    AI Superchip: Each Blackwell superchip consists of two dies connected by 10TB/s C2C (chip-to-chip) interconnect, coming together as a single GPU with full cache coherence. These dies built with TSMC’s custom 4NP fabrication process feature a whopping 208 billion transistors compared to the 80 billion transistors in Hopper.

    The new NVIDIA Blackwell chips offer larger memory capacity for bigger models and more than double the memory bandwidth. This is crucial because a memory wall can prevent large AI models from taking full advantage of GPU processing power. Another key feature of Blackwell is its incredibly fast GPU-to-GPU connection, enabling multiple GPUs to work together as unified compute blocks. Here’s the NVIDIA B200 vs H200 feature comparison:

    Attribute HGX H200 8-GPUHGX B200 8-GPU
    Form factor8x NVIDIA H100 SXM8x NVIDIA Blackwell SXM
    Aggregate Memory CapacityUp to 640GBUp to 1.4TB
    Aggregate Memory BandwidthUp to 27TB/sUp to 62TB/s
    Aggregate NVLink Interconnect7.2TB/s14.4TB/s
    NVLink4th Gen5th Gen
    NVSwitch3rd Gen4th Gen

    2nd Gen Transformer Engine: features 5th Gen Tensor cores that support new quantization formats and precisions. This engine will greatly speed up the inference of Mixture of Experts (MoE) models by using dynamic range management and advanced microscaling formats. The dynamic range allows the engine to adjust and refine numerical formats to lower precision, continuously optimizing the models for better performance.

    Precision typeHGX H100 8-GPUHGX B200 8-GPU
    Tensor Core precisionsFP64, TF32, BF16, FP16, FP8, INT8FP64, TF32, BF16, FP16, FP8, INT8, FP6, FP4
    CUDA® Core precisionsFP64, FP32, FP16, BF16, INT8FP64, FP32, FP16, BF16
    B200 vs H100 Performance

    Source: NVIDIA - Blackwell HGX performance data

    MoE models are significantly faster when it comes to running inference when compared to equivalent non-expert models due to the efficiency of conditional computing and sparsity from expert parallelism. However, these models require more VRAM because the system must load all experts and their parameters into memory. In addition to the substantially higher memory capacity and bandwidth, Blackwell’s lower precision formats and micro scaling help alleviate this problem by enabling larger models with more parameters to fit into GPUs.

    B200 Inference Performance

    Source: NVIDIA - Blackwell HGX performance data

    This paper on Microscaling (MX) formats for generative AI discusses benchmark findings, showcasing the impressive potential of smaller precision formats for both training and inference, with only minor accuracy losses. As smaller precision formats advance, more ML developers are likely to embrace these innovations for model development.

    B200 Training Performance

    Source: NVIDIA - Blackwell HGX performance data

    The new Transformer Engine speeds up LLM training by enhancing the Nemo Framework and integrating expert parallelism techniques from Megatron-Core. We expect these advancements to pave the way for creating the first 10 trillion parameter model. 5th Generation NVLink: At 1.8TB/s bidirectional throughput per GPU, this new generation of GPU to GPU interconnect is twice as fast as the previous gen and can enable seamless high-speed communication among up to 576 GPUs. Accelerated in-network computation makes NCCL collective operations more efficient and helps GPUs reach synchronization faster. The latest generation of NVLink NVSwitch enables multi-GPU clusters such as the GB200 NVL72 for an accumulative bandwidth of 130TB/s for large models. 

    Confidential computing enhancements: The latest Blackwell GPUs now feature Trusted Execution Environment (TEE) technology. While CPUs have supported TEE for a long time to ensure data confidentiality and integrity in applications like content authentication and secure financial transactions, NVIDIA GPUs now also offer TEE-I/O capabilities. This means enhanced data protection through inline protection on NVLink connections. Additionally, Blackwell GPUs provide data encryption at rest, in motion, and during computation.

    Superfast decompression for data analytics: Blackwell can decompress data at a blistering 800GB/s speed with formats such as LZ4, Snappy and Deflate. The GB200 GPU charged by 8TB/s bandwidth of HBM3e (High Bandwidth Memory) and the lightning-fast NVLink-C2C interconnect of the Grace CPU makes the data pipeline extremely fast. NVIDIA’s benchmarks run on a GB200 GPU cluster reveal 18x faster queries/sec than a traditional CPU and 6x faster than an H100 GPU, making GPUs more suitable for data analytics and database workflows.

    Reliability, availability and serviceability (RAS) engine: performs automatic, built-in tests on computational cores and memory in the Blackwell chip. This is especially important for large supercomputer clusters, as it allows teams to replace underperforming GPU boards and keep performance high while protecting their GPU investments. 

    Rent Blackwell B200 GPUs

    Understanding the Blackwell GPU lineup: B200 vs B300 vs GB200

    The NVIDIA Blackwell family of GPU-based systems comprises HGX B100, HGX B200, DGX B200, and NVIDIA DGX supercomputers such as GB200 NVL36, and GB200 NVL72. The table below lists their specs and performance benchmarks provided by NVIDIA: 

    AttributeHGX B200HGX B300GB200 NVL72
    Form factor8x NVIDIA B200 SXM8x NVIDIA Blackwell Ultra SXM1:2 GB Board (36 Grace CPU:72 Blackwell GPUs)
    CPU Platformx86x86Grace powered by 2,592 Arm® Neoverse V2 cores
    Aggregate Memory1.4 TBUp to 2.3 TBUp to 30TB
    Total NVLink Bandwidth14.4 TB/s14.4 TB/s130TB/s
    FP4 Tensor Core144 PFLOPS144 PFLOPS1,440 PFLOPS
    FP8/FP6 Tensor Core72 PFLOPS72 PFLOPS720 PFLOPS
    INT8 Tensor Core72 POPS2 POPS720 POPS
    FP16/BF16 Tensor Core36 PFLOPS36 PFLOPS360 PFLOPS
    TF32 Tensor Core18 PFLOPS18 PFLOPS180 PFLOPS
    FP32600 TFLOPS600 TFLOPS5760 TFLOPS
    FP64 / FP64 Tensor Core296 TFLOPS10 TFLOPS2880 TFLOPS

    Scale models to multi-trillion parameters with NVIDIA GB200 supercomputers

    The GB200 superchip forms the core of GB200 supercomputers, combining 1 Grace CPU and 2 Blackwell GPUs in a memory-coherent, unified memory space. The GB200 system comes in different versions, such as the GB200 NVL36 and GB200 NVL72, depending on the number of GPUs. Each rack can hold 9 or 18 GB200 compute node trays, depending on the design. These racks include cold plates and connections for liquid cooling, PCIe Gen 6 for fast networking, and NVLink connectors for seamless NVLink cable integration.

    • GB200 NVL36 is one rack of 9x dual-GB200 (4 GPUs, 2 CPUs) compute nodes and 9x NVSwitch trays
    • GB200 NVL72 can be two racks of 9x trays of dual-GB200 compute nodes and 9x NVSwitch trays
    • GB200 NVL72 can also be one rack of 18x trays of dual-GB200 compute nodes and 9x NVSwitch trays

    Here’s why GB200 supercomputers are perfect to handle the complexity of large models:

    1. AI performance advantage from a massive compute block: The GB200 superchip is equipped with a new transformer engine, fifth-generation NVLink and 1.8 TB/s of GPU-to-GPU interconnect delivers 4X faster training performance for large language models like GPT-MoE-1.8T. The superchip also features InfiniBand networking and NVIDIA Magnum IO™ software ensures efficient scalability of extensive GPU computing clusters with up to 576 GPUs.
    2. Grace CPU with superior LPDDR5X memory: The Grace CPU is a powerhouse of 144 ARM v9 Neoverse cores delivering up to 7.1 TFLOPS of performance and can access 960GB of LPDDR5X RAM at 1TB/s memory bandwidth. This ultra-fast, low-power memory accelerates transactions while maintaining data integrity through error correction code (ECC), making it suitable for critical workloads.

    3. Blazing fast CPU interconnect with simplified NUMA: powered by 900GB/s NVLINK C2C interconnect which is several times faster than traditional PCIe interconnect.
    Grace CPU Architecture

    Source: NVIDIA - Grace CPU Whitepaper

    NVIDIA GB200 Clusters for AI

    Explore use cases for GB200 AI supercomputers

    Here are some examples of use cases for supercomputers such as NVIDIA GBL200 NVL72:

    • Unlocking training for trillion parameter models: With state-of-the-art (SOTA) models increasingly featuring more than trillion parameters, training a 1.8 trillion parameter is 4 times faster than an equivalent H100 GPU cluster.
    NVIDIA GB200 Training Performance
    • Scalable inference for powerful generative AI models: A Menlo Ventures study of business leaders on adopting generative AI revealed that 96% of computing spend on generative AI goes towards inference, highlighting the importance of optimizing performance for better ROI. The two main factors to consider when implementing inference are size and speed - businesses aim to offer instant experiences to their users as they transform their products and services with AI, regardless of the size of their customer base.
    GB200 Inference Performance

    This is where the GB200 NVL72 cluster becomes crucial, providing up to 30 times better inference performance at real-time speeds bringing the scalability benefits of Blackwell's architecture to practical inference use cases in business and consumer applications.

    • Seamless execution of Mixture of Experts (MoE) models: The massive aggregate memory of up to 13.5 TB in GB200 systems and incredibly fast GPU interconnect helps AI teams realize the potential of MoE models better than ever before. The visualization below shows how experts in an MoE model communicate with each other and across the model's layers. Without Blackwell’s NVLINK interconnect, NVIDIA estimates that GPUs would spend half their time on communication instead of computation.
    GB200 NVL72 Specifications


    • Superlative vector database and retrieval-augmented generation (RAG) performance: Grace CPU’s 960GB of memory and 900GB/s C2C link is perfect to accelerate RAG pipelines via low-latency vector search. 

    • Sustainable AI computing: Combined with energy savings from liquid cooling and the efficiency of the GB200 supercomputing system, GB200NVL72 is 25x more energy efficient when compared to an equivalent NVIDIA H100 cluster.

    Power your AI with Blackwell GPUs on Ori

    Want to leverage the power of NVIDIA Blackwell GPUs for your AI-focused business? Ori’s AI Native cloud is purpose-built for AI/ML workloads featuring top-notch GPUs, performant storage and AI-ready networking so you can:

    Train and serve world-changing AI models on Ori! Reserve your Blackwell GPUs today!

    Share