A Comprehensive Guide to Running Slurm on Ori GPU Instances

Modern AI teams don’t just need GPUs, they need a reliable way to allocate, isolate, schedule, and audit them across users and workloads. That’s exactly why Slurm remains the workload manager of choice for everything from research labs to production AI platforms: it gives you deterministic resource control, GPU-aware scheduling, dependency-driven pipelines, and rich job accounting.
Pair Slurm with the right infrastructure, and you get a practical foundation for repeatable AI execution at scale. Ori’s Virtual Machines known as GPU Instances are purpose-built for AI workloads: fast to provision, designed for high utilization, and ideal for reproducible environments where you want the operational familiarity of VMs with the performance characteristics AI jobs demand. Running Slurm on an Ori GPU Instance lets you stand up a clean scheduling layer quickly, perfect for proof-of-concepts, team sandboxes, and even production patterns that start small and scale out.
This guide provides a complete, step-by-step walkthrough for installing and configuring a single-node Slurm installation on an Ori VM, turning it into a robust, self-managed job scheduling environment. It includes installation, configuration, Job submission, monitoring, management and troubleshooting, with all commands and configs included end-to-end.
Prerequisites
VM Specifications:
- OS: Ubuntu 24.04 LTS
- CPUs: 48 cores
- RAM: 483 GB
- GPUs: 2x NVIDIA H100SXM 80GB
- Storage: Sufficient space for Slurm state and logs
Network Requirements:
- SSH access to the VM
- Hostname resolution configured
Installation Steps
1. System Update and Dependencies
Update the system and install required packages:
1sudo apt update && sudo apt upgrade -y
2
3sudo apt install -y \
4 build-essential \
5 git \
6 wget \
7 curl \
8 munge \
9 libmunge-dev \
10 mariadb-client \
11 libmariadb-dev \
12 libhwloc-dev \
13 libjson-c-dev \
14 libhttp-parser-dev \
15 libyaml-dev \
16 libjwt-dev \
17 libdbus-1-dev \
18 python3 \
19 python3-pip2. MUNGE Authentication
MUNGE provides authentication between Slurm components. All nodes in the cluster must have the same MUNGE key.
Generate MUNGE Key
Check if MUNGE key already exists:
1ls -la /etc/munge/munge.keyIf the key doesn't exist, generate it manually:
1# Create MUNGE key directory if it doesn't exist
2sudo mkdir -p /etc/munge
3
4# Generate a new MUNGE key
5sudo dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
6
7# Alternative: Use create-munge-key if available
8sudo create-munge-key -fSet Correct Permissions
MUNGE is very strict about file permissions. Set the correct ownership and permissions:
1# Set ownership to munge user
2sudo chown munge:munge /etc/munge/munge.key
3
4# Set restrictive permissions (only munge user can read)
5sudo chmod 400 /etc/munge/munge.key
6
7# Verify permissions
8ls -la /etc/munge/munge.key
9# Expected output: -r-------- 1 munge munge 1024 <date> /etc/munge/munge.keyVerify MUNGE Key
Check that the key is valid and has correct format:
1# Verify key is readable by munge
2sudo -u munge cat /etc/munge/munge.key > /dev/null && echo "Key is readable" || echo "Key is NOT readable"Start and Enable MUNGE
1# Enable MUNGE to start on boot
2sudo systemctl enable munge
3
4# Start MUNGE service
5sudo systemctl start munge
6
7# Verify MUNGE is running
8sudo systemctl status mungeCreate Slurm user:
1sudo groupadd -g 64030 slurm
2sudo useradd -u 64030 -g slurm -s /bin/bash -d /var/lib/slurm slurm3. Slurm Installation
Download and compile Slurm from source:
1cd /tmp
2wget https://download.schedmd.com/slurm/slurm-24.05.3.tar.bz2
3tar -xjf slurm-24.05.3.tar.bz2
4cd slurm-24.05.3
5
6./configure \
7 --prefix=/usr \
8 --sysconfdir=/etc/slurm \
9 --with-munge \
10 --with-hwloc \
11 --with-json \
12 --with-http-parser \
13 --with-yaml \
14 --with-jwt \
15 --enable-pam
16
17make -j$(nproc)
18sudo make installVerify installation:
1slurmctld --version
2# Output: slurm 24.05.34. Configuration
Create Slurm Directories
1sudo mkdir -p /etc/slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
2sudo chown -R slurm:slurm /etc/slurm /var/spool/slurm /var/log/slurmConfiguration Files
/etc/slurm/slurm.conf - Main Slurm configuration:
1# Slurm Configuration File for Single-Node Setup
2ClusterName=ori-slurm-poc
3SlurmctldHost=virtual-machine
4
5# Authentication
6AuthType=auth/munge
7CryptoType=crypto/munge
8
9# Scheduling
10SchedulerType=sched/backfill
11SelectType=select/cons_tres
12SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
13
14# Logging
15SlurmctldDebug=info
16SlurmctldLogFile=/var/log/slurm/slurmctld.log
17SlurmdDebug=info
18SlurmdLogFile=/var/log/slurm/slurmd.log
19
20# Process Tracking
21ProctrackType=proctrack/cgroup
22TaskPlugin=task/cgroup,task/affinity
23
24# GRES (GPU) support
25GresTypes=gpu
26
27# State preservation
28StateSaveLocation=/var/spool/slurm/ctld
29SlurmdSpoolDir=/var/spool/slurm/d
30
31# Timeouts
32SlurmctldTimeout=300
33SlurmdTimeout=300
34InactiveLimit=0
35MinJobAge=300
36KillWait=30
37Waittime=0
38
39# Job Defaults
40DefMemPerCPU=2048
41
42# Accounting
43AccountingStorageType=accounting_storage/slurmdbd
44AccountingStorageHost=localhost
45AccountingStoragePort=6819
46AccountingStorageEnforce=associations
47JobAcctGatherType=jobacct_gather/linux
48JobAcctGatherFrequency=30
49
50# Node Definitions (adjust CPUs and RealMemory based on your VM)
51NodeName=virtual-machine CPUs=48 RealMemory=483000 Gres=gpu:h100:2 State=UNKNOWN
52
53# Partition Definitions
54PartitionName=gpu Nodes=virtual-machine Default=YES MaxTime=INFINITE State=UP OverSubscribe=NO/etc/slurm/gres.conf - GPU resource configuration:
1# GPU Resource Configuration
2AutoDetect=nvml
3NodeName=virtual-machine Name=gpu Type=h100 File=/dev/nvidia0
4NodeName=virtual-machine Name=gpu Type=h100 File=/dev/nvidia1/etc/slurm/cgroup.conf - Resource isolation to prevent jobs using resources allocated to other jobs:
1# Cgroup Configuration for Resource Isolation
2ConstrainCores=yes
3ConstrainDevices=yes
4ConstrainRAMSpace=yes
5ConstrainSwapSpace=yesSystemd Service Files
/etc/systemd/system/slurmctld.service - Controller daemon:
1[Unit]
2Description=Slurm controller daemon
3After=network.target munge.service
4Requires=munge.service
5
6[Service]
7Type=simple
8User=root
9Group=root
10ExecStart=/usr/sbin/slurmctld -D
11ExecReload=/bin/kill -HUP $MAINPID
12KillMode=process
13LimitNOFILE=131072
14LimitMEMLOCK=infinity
15LimitSTACK=infinity
16Restart=on-failure
17
18[Install]
19WantedBy=multi-user.target/etc/systemd/system/slurmd.service - Compute node daemon:
1[Unit]
2Description=Slurm node daemon
3After=network.target munge.service
4Requires=munge.service
5
6[Service]
7Type=simple
8User=root
9Group=root
10ExecStart=/usr/sbin/slurmd -D
11ExecReload=/bin/kill -HUP $MAINPID
12KillMode=process
13LimitNOFILE=131072
14LimitMEMLOCK=infinity
15LimitSTACK=infinity
16Restart=on-failure
17
18[Install]
19WantedBy=multi-user.target5. Database Setup for Job Accounting
Install MariaDB and configure Slurm Database Daemon (slurmdbd) for persistent job tracking.
Install MariaDB
1sudo apt install -y mariadb-server mariadb-client
2sudo systemctl enable mariadb
3sudo systemctl start mariadbCreate Slurm Accounting Database
1sudo mysql -e "CREATE DATABASE slurm_acct_db;"
2sudo mysql -e "CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurmdbpass';"
3sudo mysql -e "GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';"
4sudo mysql -e "FLUSH PRIVILEGES;"Configure slurmdbd
/etc/slurm/slurmdbd.conf:
1# Slurm Database Daemon Configuration
2AuthType=auth/munge
3DbdHost=localhost
4DebugLevel=info
5LogFile=/var/log/slurm/slurmdbd.log
6PidFile=/var/run/slurmdbd.pid
7SlurmUser=slurm
8
9# Database connection
10StorageType=accounting_storage/mysql
11StorageHost=localhost
12StoragePort=3306
13StorageUser=slurm
14StoragePass=slurmdbpass
15StorageLoc=slurm_acct_dbSet permissions
1sudo chown slurm:slurm /etc/slurm/slurmdbd.conf
2sudo chmod 600 /etc/slurm/slurmdbd.conf/etc/systemd/system/slurmdbd.service:
1[Unit]
2Description=Slurm Database Daemon
3After=network.target munge.service mariadb.service
4Requires=munge.service mariadb.service
5
6[Service]
7Type=simple
8User=slurm
9Group=slurm
10ExecStart=/usr/sbin/slurmdbd -D
11ExecReload=/bin/kill -HUP $MAINPID
12KillMode=process
13LimitNOFILE=131072
14Restart=on-failure
15
16[Install]
17WantedBy=multi-user.targetStarting Services
Start all Slurm services in the correct order:
1# Reload systemd
2sudo systemctl daemon-reload
3
4# Start slurmdbd (database daemon)
5sudo systemctl enable slurmdbd
6sudo systemctl start slurmdbd
7
8# Start slurmctld (controller)
9sudo systemctl enable slurmctld
10sudo systemctl start slurmctld
11
12# Start slurmd (compute node)
13sudo systemctl enable slurmd
14sudo systemctl start slurmd
15
16# Verify services
17sudo systemctl status slurmdbd
18sudo systemctl status slurmctld
19sudo systemctl status slurmdSetup Accounting
1# Add cluster to accounting
2sudo sacctmgr -i add cluster ori-slurm-poc
3
4# Create default account
5sudo sacctmgr -i add account default Description='Default Account' Organization='Ori'
6
7# Add user to account
8sudo sacctmgr -i add user ubuntu Account=default
9
10# Verify
11sacctmgr list associationsActivate the Node
1# Update node state
2sudo scontrol update nodename=virtual-machine state=resume
3
4# Clear any error messages
5sudo scontrol update nodename=virtual-machine reason="Node operational"
6
7# Verify cluster status
8sinfo
9scontrol show node virtual-machineExpected output:

Job Submission
Single Job Submission
Script: ~/scripts/submit-3min-job.sh
This script creates and submits a 3-minute test job that uses both GPUs.
1#!/bin/bash
2# Script to create and submit a 3-minute test job
3
4TIMESTAMP=$(date +%s)
5JOB_NAME="test-job-${TIMESTAMP}"
6JOB_FILE="/tmp/${JOB_NAME}.sbatch"
7
8echo "Creating job script: ${JOB_NAME}"
9
10cat > ${JOB_FILE} << "JOBEND"
11#!/bin/bash
12#SBATCH --job-name=test-3min
13#SBATCH --output=/tmp/test-3min-%j.out
14#SBATCH --ntasks=1
15#SBATCH --cpus-per-task=8
16#SBATCH --gres=gpu:h100:2
17#SBATCH --time=00:10:00
18#SBATCH --account=default
19
20echo "=== Job Information ==="
21echo "Job ID: $SLURM_JOB_ID"
22echo "Job Name: $SLURM_JOB_NAME"
23echo "Node: $SLURMD_NODENAME"
24echo "GPUs allocated: $CUDA_VISIBLE_DEVICES"
25echo "CPUs allocated: $SLURM_CPUS_PER_TASK"
26echo "Start time: $(date)"
27echo ""
28
29echo "=== GPU Information ==="
30nvidia-smi --query-gpu=index,name,memory.total --format=csv
31echo ""
32
33echo "=== Running for 3 minutes ==="
34for i in {1..180}; do
35 if [ $((i % 30)) -eq 0 ]; then
36 echo "Progress: $i/180 seconds"
37 fi
38 sleep 1
39done
40
41echo ""
42echo "=== Job Complete ==="
43echo "End time: $(date)"
44JOBEND
45
46echo "Submitting job to Slurm..."
47sbatch ${JOB_FILE}Usage:
1chmod +x ~/scripts/submit-3min-job.sh
2~/scripts/submit-3min-job.shDependent Job Submission
Script: ~/scripts/submit-dependent-jobs.sh
This script demonstrates job chaining where Job B waits for Job A to complete successfully.
1#!/bin/bash
2# Script to submit two jobs with dependency: Job B starts after Job A completes
3
4echo "=== Creating Job A (runs for 1 minute) ==="
5
6cat > /tmp/job-a.sbatch << "JOBA"
7#!/bin/bash
8#SBATCH --job-name=job-A
9#SBATCH --output=/tmp/job-a-%j.out
10#SBATCH --ntasks=1
11#SBATCH --cpus-per-task=4
12#SBATCH --gres=gpu:h100:1
13#SBATCH --time=00:10:00
14#SBATCH --account=default
15
16echo "=== Job A Started ==="
17echo "Job ID: $SLURM_JOB_ID"
18echo "Start time: $(date)"
19echo "Running for 1 minute..."
20
21for i in {1..60}; do
22 if [ $((i % 15)) -eq 0 ]; then
23 echo "Job A progress: $i/60 seconds"
24 fi
25 sleep 1
26done
27
28echo "Job A completed at: $(date)"
29JOBA
30
31echo "=== Creating Job B (runs for 2 minutes, depends on Job A) ==="
32
33cat > /tmp/job-b.sbatch << "JOBB"
34#!/bin/bash
35#SBATCH --job-name=job-B
36#SBATCH --output=/tmp/job-b-%j.out
37#SBATCH --ntasks=1
38#SBATCH --cpus-per-task=4
39#SBATCH --gres=gpu:h100:1
40#SBATCH --time=00:10:00
41#SBATCH --account=default
42
43echo "=== Job B Started ==="
44echo "Job ID: $SLURM_JOB_ID"
45echo "Start time: $(date)"
46echo "Running for 2 minutes..."
47
48for i in {1..120}; do
49 if [ $((i % 30)) -eq 0 ]; then
50 echo "Job B progress: $i/120 seconds"
51 fi
52 sleep 1
53done
54
55echo "Job B completed at: $(date)"
56JOBB
57
58echo ""
59echo "=== Submitting Job A ==="
60JOB_A_OUTPUT=$(sbatch /tmp/job-a.sbatch)
61JOB_A_ID=$(echo $JOB_A_OUTPUT | awk '{print $4}')
62echo "Job A submitted: $JOB_A_OUTPUT"
63echo "Job A ID: $JOB_A_ID"
64
65echo ""
66echo "=== Submitting Job B (depends on Job A completing) ==="
67JOB_B_OUTPUT=$(sbatch --dependency=afterok:$JOB_A_ID /tmp/job-b.sbatch)
68JOB_B_ID=$(echo $JOB_B_OUTPUT | awk '{print $4}')
69echo "Job B submitted: $JOB_B_OUTPUT"
70echo "Job B ID: $JOB_B_ID"
71
72echo ""
73echo "=== Job Dependency Summary ==="
74echo "Job A (ID: $JOB_A_ID) - Will run immediately"
75echo "Job B (ID: $JOB_B_ID) - Will wait for Job A to complete successfully"
76echo ""
77echo "To monitor: squeue"
78echo "Job A output: /tmp/job-a-$JOB_A_ID.out"
79echo "Job B output: /tmp/job-b-$JOB_B_ID.out"Usage:
1chmod +x ~/scripts/submit-dependent-jobs.sh
2~/scripts/submit-dependent-jobs.shDependency Types:
--dependency=afterok:JOBID - Start after job completes successfully (exit code 0)
--dependency=after:JOBID - Start after job completes (any exit code)
--dependency=afternotok:JOBID - Start only if job fails
--dependency=afterany:JOBID - Start after job ends (any state)
Monitoring and Management
Check Job Queue
1# View running and pending jobs
2squeue
3
4# Detailed queue information
5squeue -l
6
7# Custom format
8squeue -o '%.8i %.12j %.10u %.8T %.10M %.6D %.20b %R'Check Job History
1# View completed jobs (today)
2sacct
3
4# View all jobs from specific date
5sacct --starttime=2026-01-05
6
7# Show only main jobs (no steps)
8sacct -X --format=JobID,JobName,State,AllocCPUS,Elapsed,Start,End
9
10# Specific job details
11sacct -j <job_id> --format=JobID,JobName,State,AllocCPUS,AllocTRES,ElapsedCluster Status
1# View partition status
2sinfo
3
4# Detailed node information
5scontrol show node virtual-machine
6
7# Show partition details
8scontrol show partition gpuJob Control
1# Cancel a job
2scancel <job_id>
3
4# Hold a pending job (prevent from starting)
5scontrol hold <job_id>
6
7# Release a held job
8scontrol release <job_id>
9
10# Suspend a running job (pause)
11scontrol suspend <job_id>
12
13# Resume a suspended job
14scontrol resume <job_id>View Job Output
1# Tail output file while job is running
2tail -f /tmp/test-3min-<job_id>.out
3
4# View completed job output
5cat /tmp/test-3min-<job_id>.outIf you run into any issues during setup or operation, this troubleshooting guide walks you through the most common Slurm pitfalls and how to resolve them.
Summary
You now have a fully functional Slurm cluster on a single Ori VM with:
- Controller and compute functionality
- GPU scheduling (2x H100)
- Job accounting with MariaDB
- Job dependency support
- Historical job tracking
Quick Command Reference
1# Submit job
2sbatch myjob.sbatch
3
4# Monitor jobs
5squeue
6
7# View history
8sacct --starttime=today
9
10# Check cluster
11sinfo
12
13# Cancel job
14scancel <job_id>
15
16# Node status
17scontrol show nodeLeverage the power of Slurm on Ori
Slurm gives you the operational backbone for GPU compute: consistent scheduling semantics, enforceable isolation, and the ability to build dependable pipelines with accounting that stands up to audits and cost attribution. Ori GPU Instances make that experience even more practical: you get a clean VM-based environment for Slurm that’s easy to provision, easy to reproduce, and powerful enough to run serious AI workloads (including multi-GPU jobs).
If your goal is to run AI jobs with less waste, more control, and clearer visibility, Slurm on Ori is a straightforward path: proven scheduling, modern GPU infrastructure, and a setup you can take from “first node” to cluster-scale patterns with minimal friction.
