DDP vs DeepSpeed ZeRO-3: Understanding GPU utilization patterns for multi-GPU training with Slurm

Multi-GPU training is essential for fine-tuning large language models, but not all parallelization strategies deliver the same efficiency. While PyTorch's Distributed Data Parallel (DDP) is the default approach for many teams, our experiments revealed it can leave significantly underuse compute—with GPUs sitting idle nearly half the time during synchronization.
We compared GPU utilization patterns between standard DDP and DeepSpeed ZeRO Stage 3 while fine-tuning a 20B parameter model on two NVIDIA H100 GPUs. The results were striking: DDP showed alternating GPU usage peaking at 60%, while DeepSpeed ZeRO-3 achieved sustained 100% utilization on both GPUs simultaneously. However, as we'll discuss, higher utilization doesn't always mean faster training—the relationship between GPU efficiency and actual runtime is more nuanced than it appears.
Here's a quick overview of our setup:
| Configuration | Details |
|---|---|
| GPUs | 2x NVIDIA H100 80GB |
| Model | openai/gpt-oss-20b (Mxfp4 quantized) |
| Fine-tuning method | LoRA (rank=64, alpha=16) |
| Orchestration | Slurm workload manager |
| Monitoring | Prometheus + Grafana + DCGM Exporter |
The full implementation is available in our GitHub repository.
The DDP pattern: alternating GPU usage
With standard PyTorch DDP, we observed a distinctive alternating pattern in GPU utilization:
DDP GPU Utilization
GPU Utilization with Standard DDP: GPU 0 (green) and GPU 1 (yellow) show alternating peaks between 30-60%.
| Metric | GPU 0 | GPU 1 |
|---|---|---|
| Peak Utilization | ~58% | ~38% |
| Memory Used | ~50 GB | ~50 GB |
| Pattern | Alternating | Alternating |
In DDP, each GPU holds a complete copy of the model and processes different batches independently. The alternating pattern emerges because GPUs compute independently, then wait for each other during gradient synchronization—creating significant idle time.
The DeepSpeed ZeRO-3 pattern: simultaneous usage
Switching to DeepSpeed ZeRO Stage 3 produced a dramatically different result:
DeepSpeed ZeRO-3 GPU Utilization
GPU Utilization with DeepSpeed ZeRO-3: Both GPUs show sustained ~100% utilization simultaneously.
| Metric | GPU 0 | GPU 1 |
|---|---|---|
| Peak Utilization | ~100% | ~100% |
| Memory Used | ~50 GB | ~50 GB |
| Pattern | Simultaneous | Simultaneous |
DeepSpeed ZeRO-3 shards model parameters across GPUs instead of replicating them. For each layer, the owning GPU broadcasts weights to all GPUs, both compute, then non-owners discard the weights. The key is overlap_comm: true — while computing layer N (see configuration below). By using stage3_prefetch_bucket_size, DeepSpeed prefetches layer N+1 in the background, eliminating idle time.
DeepSpeed ZeRO-3 configuration
Here's the configuration file that enables this behavior:
1{
2 "bf16": {
3 "enabled": true
4 },
5 "zero_optimisation": {
6 "stage": 3,
7 "offload_optimizer": {
8 "device": "none"
9 },
10 "offload_param": {
11 "device": "none"
12 },
13 "overlap_comm": true,
14 "contiguous_gradients": true,
15 "reduce_bucket_size": "auto",
16 "stage3_prefetch_bucket_size": "auto",
17 "stage3_param_persistence_threshold": "auto",
18 "stage3_gather_16bit_weights_on_model_save": true
19 },
20 "gradient_accumulation_steps": 8,
21 "gradient_clipping": 0.3,
22 "train_micro_batch_size_per_gpu": 1
23}How to reproduce these results on H100 GPUs
Pre-requisites
Create a GPU virtual machine on Ori Global Cloud with 2x NVIDIA H100SXM 80GB GPUs. We recommend using the init script so NVIDIA CUDA drivers and frameworks are preinstalled.
Use SSH tunneling to access Grafana dashboards when firewall rules block direct port access.
Step 1: SSH into your VM and clone the repository
1git clone https://github.com/ori-edge/slurm-ml-pipelines.git
2cd slurm-ml-pipelinesStep 2: Start the monitoring stack
1cd monitoring
2docker-compose up -dThis deploys Prometheus, Grafana, and DCGM Exporter for GPU metrics collection.
Step 3: Run DDP training and observe the pattern
1sbatch ml-pipeline/jobs/03_training.sbatchStep 4: Run DeepSpeed ZeRO-3 training and compare
1sbatch ml-pipeline/jobs/03_training_deepspeed.sbatchStep 5: Access Grafana via SSH tunnel
1ssh -L 3000:localhost:3000 ubuntu@<VM_IP> -i ~/.ssh/id_rsa -NOpen http://localhost:3000 in your browser to view real-time GPU utilization.
The critical caveat: utilization ≠ runtime when you run small training jobs
Here's where it gets interesting. Despite the dramatic difference in GPU utilization, actual training time was nearly identical:
| Metric | DDP | DeepSpeed ZeRO-3 |
|---|---|---|
| Training Start | 16:30:23 | 09:10:32 |
| Training End | 16:49:13 | 09:28:31 |
| Actual Training Time | ~19 minutes | ~18 minutes |
| GPU Utilization | ~50% | ~100% |
How can 100% utilization produce similar runtime to 50%? Three factors explain this:
1. Communication overhead
ZeRO-3's parameter sharding requires gathering and scattering weights for every layer. Even with overlap_comm: true, this continuous communication adds latency that offsets the utilization gains.
2. Small dataset
Our training used only 321 samples over 60 steps. This isn't enough compute to amortize DeepSpeed's initialization and per-layer communication costs. The overhead is fixed, but the benefit scales with workload size.
3. LoRA fine-tuning
With LoRA, only ~1% of parameters are trainable. ZeRO-3's parameter sharding benefits are minimal because there's little to shard. The frozen base model weights don't participate in gradient computation.
When does ZeRO-3 actually speed up training?
ZeRO-3's utilization advantage translates to actual speedup when:
| Scenario | Why ZeRO-3 Wins |
|---|---|
| Full fine-tuning | All parameters trainable = significant sharding benefit |
| Large datasets | Thousands of steps amortize initialization overhead |
| Memory-constrained | ZeRO-3 enables training that won't fit with DDP |
| Longer sequences | More compute per communication round |
For our LoRA experiment, the higher utilization represents better hardware efficiency but not faster training. Think of it like a car engine running at higher RPM but with more air resistance—more power output, but similar speed.
Why the utilization difference still matters
Even without runtime gains in our specific experiment, understanding utilization patterns is valuable:
| Aspect | DDP | DeepSpeed ZeRO-3 |
|---|---|---|
| GPU 0 Utilization | ~58% peak | ~100% sustained |
| GPU 1 Utilization | ~38% peak | ~100% sustained |
| Idle Time | Significant | Minimal |
| Effective Compute | ~50% | ~100% |
Higher utilization means your GPUs are doing useful work instead of waiting. For production workloads with full fine-tuning and large datasets, this efficiency gap directly translates to faster training and lower costs. More on full fine-tuning on a large dataset following in the next blog post.
Why is memory usage similar?
You might expect ZeRO-3 to use less memory since it shards parameters. In our case, activations (intermediate outputs for backpropagation) dominate memory usage at ~25-30 GB, and these are not sharded by ZeRO-3. Since we're using a pre-quantized model with LoRA, the memory savings from parameter sharding are offset by activation memory.
When to use each approach
Use DDP when:
- Using LoRA or other parameter-efficient fine-tuning methods
- Training on small datasets (hundreds to low thousands of samples)
- Network bandwidth between GPUs is limited
- You prioritize simplicity and don't need memory optimization
Use DeepSpeed ZeRO-3 when:
- Doing full fine-tuning with all parameters trainable
- Training on large datasets (tens of thousands+ samples)
- Model doesn't fit in GPU memory with DDP
- Running long training jobs where initialization overhead is amortize
Run multi-GPU training on Ori GPU Instances and Supercomputers
Utilization patterns only matter when they’re visible and repeatable, not obscured by infrastructure noise. Ori GPU Instances and Supercomputers are designed to keep those signals clear. You get direct access to the latest NVIDIA GPUs and predictable performance, so training behavior reflects your parallelization strategy rather than hidden abstraction layers.
GPU Instances provide a fast, clean way to run controlled experiments, while Supercomputers let you extend the same workflows to larger, production-scale runs. In both cases, you can run multi-GPU jobs consistently and observe real utilization with standard tools like DCGM, Prometheus, and Grafana.
If your goal is to train models with less waste, more control, and visibility you can trust, Ori offers a straightforward foundation: modern GPUs and an execution environment that scales cleanly from first experiments to large-scale training.
