Multi-GPU training is essential for fine-tuning large language models, but not all parallelization strategies deliver the same efficiency. While PyTorch's Distributed Data Parallel (DDP) is the default approach for many teams, our experiments revealed it can leave significantly underuse compute—with GPUs sitting idle nearly half the time during synchronization.

We compared GPU utilization patterns between standard DDP and DeepSpeed ZeRO Stage 3 while fine-tuning a 20B parameter model on two NVIDIA H100 GPUs. The results were striking: DDP showed alternating GPU usage peaking at 60%, while DeepSpeed ZeRO-3 achieved sustained 100% utilization on both GPUs simultaneously. However, as we'll discuss, higher utilization doesn't always mean faster training—the relationship between GPU efficiency and actual runtime is more nuanced than it appears.

Here's a quick overview of our setup:

Configuration	Details
GPUs	2x NVIDIA H100 80GB
Model	openai/gpt-oss-20b (Mxfp4 quantized)
Fine-tuning method	LoRA (rank=64, alpha=16)
Orchestration	Slurm workload manager
Monitoring	Prometheus + Grafana + DCGM Exporter

The full implementation is available in our GitHub repository.

The DDP pattern: alternating GPU usage

With standard PyTorch DDP, we observed a distinctive alternating pattern in GPU utilization:

DDP GPU Utilization

GPU Utilization with Standard DDP: GPU 0 (green) and GPU 1 (yellow) show alternating peaks between 30-60%.

Metric	GPU 0	GPU 1
Peak Utilization	~58%	~38%
Memory Used	~50 GB	~50 GB
Pattern	Alternating	Alternating

In DDP, each GPU holds a complete copy of the model and processes different batches independently. The alternating pattern emerges because GPUs compute independently, then wait for each other during gradient synchronization—creating significant idle time.

The DeepSpeed ZeRO-3 pattern: simultaneous usage

Switching to DeepSpeed ZeRO Stage 3 produced a dramatically different result:

DeepSpeed ZeRO-3 GPU Utilization

GPU Utilization with DeepSpeed ZeRO-3: Both GPUs show sustained ~100% utilization simultaneously.

Metric	GPU 0	GPU 1
Peak Utilization	~100%	~100%
Memory Used	~50 GB	~50 GB
Pattern	Simultaneous	Simultaneous

DeepSpeed ZeRO-3 shards model parameters across GPUs instead of replicating them. For each layer, the owning GPU broadcasts weights to all GPUs, both compute, then non-owners discard the weights. The key is overlap_comm: true — while computing layer N (see configuration below). By using stage3_prefetch_bucket_size, DeepSpeed prefetches layer N+1 in the background, eliminating idle time.

DeepSpeed ZeRO-3 configuration

Here's the configuration file that enables this behavior:

JSONCopy

1{
2    "bf16": {
3        "enabled": true
4    },
5    "zero_optimisation": {
6        "stage": 3,
7        "offload_optimizer": {
8            "device": "none"
9        },
10        "offload_param": {
11            "device": "none"
12        },
13        "overlap_comm": true,
14        "contiguous_gradients": true,
15        "reduce_bucket_size": "auto",
16        "stage3_prefetch_bucket_size": "auto",
17        "stage3_param_persistence_threshold": "auto",
18        "stage3_gather_16bit_weights_on_model_save": true
19    },
20    "gradient_accumulation_steps": 8,
21    "gradient_clipping": 0.3,
22    "train_micro_batch_size_per_gpu": 1
23}

How to reproduce these results on H100 GPUs

Pre-requisites

Create a GPU virtual machine on Ori Global Cloud with 2x NVIDIA H100SXM 80GB GPUs. We recommend using the init script so NVIDIA CUDA drivers and frameworks are preinstalled.

Quick Tip

Use SSH tunneling to access Grafana dashboards when firewall rules block direct port access.

Step 1: SSH into your VM and clone the repository

Bash/ShellCopy

1git clone https://github.com/ori-edge/slurm-ml-pipelines.git
2cd slurm-ml-pipelines

Step 2: Start the monitoring stack

Bash/ShellCopy

1cd monitoring
2docker-compose up -d

This deploys Prometheus, Grafana, and DCGM Exporter for GPU metrics collection.

Step 3: Run DDP training and observe the pattern

Bash/ShellCopy

1sbatch ml-pipeline/jobs/03_training.sbatch

Step 4: Run DeepSpeed ZeRO-3 training and compare

Bash/ShellCopy

1sbatch ml-pipeline/jobs/03_training_deepspeed.sbatch

Step 5: Access Grafana via SSH tunnel

Bash/ShellCopy

1ssh -L 3000:localhost:3000 ubuntu@<VM_IP> -i ~/.ssh/id_rsa -N

Open http://localhost:3000 in your browser to view real-time GPU utilization.

The critical caveat: utilization ≠ runtime when you run small training jobs

Here's where it gets interesting. Despite the dramatic difference in GPU utilization, actual training time was nearly identical:

Metric	DDP	DeepSpeed ZeRO-3
Training Start	16:30:23	09:10:32
Training End	16:49:13	09:28:31
Actual Training Time	~19 minutes	~18 minutes
GPU Utilization	~50%	~100%

How can 100% utilization produce similar runtime to 50%? Three factors explain this:

1. Communication overhead

ZeRO-3's parameter sharding requires gathering and scattering weights for every layer. Even with overlap_comm: true, this continuous communication adds latency that offsets the utilization gains.

2. Small dataset

Our training used only 321 samples over 60 steps. This isn't enough compute to amortize DeepSpeed's initialization and per-layer communication costs. The overhead is fixed, but the benefit scales with workload size.

3. LoRA fine-tuning

With LoRA, only ~1% of parameters are trainable. ZeRO-3's parameter sharding benefits are minimal because there's little to shard. The frozen base model weights don't participate in gradient computation.

When does ZeRO-3 actually speed up training?

ZeRO-3's utilization advantage translates to actual speedup when:

Scenario	Why ZeRO-3 Wins
Full fine-tuning	All parameters trainable = significant sharding benefit
Large datasets	Thousands of steps amortize initialization overhead
Memory-constrained	ZeRO-3 enables training that won't fit with DDP
Longer sequences	More compute per communication round

For our LoRA experiment, the higher utilization represents better hardware efficiency but not faster training. Think of it like a car engine running at higher RPM but with more air resistance—more power output, but similar speed.

Why the utilization difference still matters

Even without runtime gains in our specific experiment, understanding utilization patterns is valuable:

Aspect	DDP	DeepSpeed ZeRO-3
GPU 0 Utilization	~58% peak	~100% sustained
GPU 1 Utilization	~38% peak	~100% sustained
Idle Time	Significant	Minimal
Effective Compute	~50%	~100%

Higher utilization means your GPUs are doing useful work instead of waiting. For production workloads with full fine-tuning and large datasets, this efficiency gap directly translates to faster training and lower costs. More on full fine-tuning on a large dataset following in the next blog post.

Why is memory usage similar?

You might expect ZeRO-3 to use less memory since it shards parameters. In our case, activations (intermediate outputs for backpropagation) dominate memory usage at ~25-30 GB, and these are not sharded by ZeRO-3. Since we're using a pre-quantized model with LoRA, the memory savings from parameter sharding are offset by activation memory.

When to use each approach

Use DDP when:

Using LoRA or other parameter-efficient fine-tuning methods
Training on small datasets (hundreds to low thousands of samples)
Network bandwidth between GPUs is limited
You prioritize simplicity and don't need memory optimization

Use DeepSpeed ZeRO-3 when:

Doing full fine-tuning with all parameters trainable
Training on large datasets (tens of thousands+ samples)
Model doesn't fit in GPU memory with DDP
Running long training jobs where initialization overhead is amortize

Run multi-GPU training on Ori GPU Instances and Supercomputers

Utilization patterns only matter when they’re visible and repeatable, not obscured by infrastructure noise. Ori GPU Instances and Supercomputers are designed to keep those signals clear. You get direct access to the latest NVIDIA GPUs and predictable performance, so training behavior reflects your parallelization strategy rather than hidden abstraction layers.

GPU Instances provide a fast, clean way to run controlled experiments, while Supercomputers let you extend the same workflows to larger, production-scale runs. In both cases, you can run multi-GPU jobs consistently and observe real utilization with standard tools like DCGM, Prometheus, and Grafana.

If your goal is to train models with less waste, more control, and visibility you can trust, Ori offers a straightforward foundation: modern GPUs and an execution environment that scales cleanly from first experiments to large-scale training.

Start Training on Ori

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.

DDP vs DeepSpeed ZeRO-3: Understanding GPU utilization patterns for multi-GPU training with Slurm

The DDP pattern: alternating GPU usage

The DeepSpeed ZeRO-3 pattern: simultaneous usage

DeepSpeed ZeRO-3 configuration

How to reproduce these results on H100 GPUs

Pre-requisites

Use SSH tunneling to access Grafana dashboards when firewall rules block direct port access.

The critical caveat: utilization ≠ runtime when you run small training jobs

When does ZeRO-3 actually speed up training?

Why the utilization difference still matters

Why is memory usage similar?

When to use each approach

Run multi-GPU training on Ori GPU Instances and Supercomputers

Build limitless AI on Ori