Tutorials

A Comprehensive Guide to Running Slurm on Ori GPU Instances

Run Slurm on Ori GPU Instances with this step-by-step guide covering installation, configuration, GPU scheduling, job accounting, monitoring, and production-ready patterns for AI workloads
Adrian Matei
Posted : January, 23, 2026
Posted : January, 23, 2026
    image

    Modern AI teams don’t just need GPUs, they need a reliable way to allocate, isolate, schedule, and audit them across users and workloads. That’s exactly why Slurm remains the workload manager of choice for everything from research labs to production AI platforms: it gives you deterministic resource control, GPU-aware scheduling, dependency-driven pipelines, and rich job accounting.

    Pair Slurm with the right infrastructure, and you get a practical foundation for repeatable AI execution at scale. Ori’s Virtual Machines known as GPU Instances are purpose-built for AI workloads: fast to provision, designed for high utilization, and ideal for reproducible environments where you want the operational familiarity of VMs with the performance characteristics AI jobs demand. Running Slurm on an Ori GPU Instance lets you stand up a clean scheduling layer quickly, perfect for proof-of-concepts, team sandboxes, and even production patterns that start small and scale out.

    This guide provides a complete, step-by-step walkthrough for installing and configuring a single-node Slurm installation on an Ori VM, turning it into a robust, self-managed job scheduling environment. It includes installation, configuration, Job submission, monitoring, management and troubleshooting, with all commands and configs included end-to-end.

    Prerequisites

    VM Specifications:

    • OS: Ubuntu 24.04 LTS
    • CPUs: 48 cores
    • RAM: 483 GB
    • GPUs: 2x NVIDIA H100SXM 80GB
    • Storage: Sufficient space for Slurm state and logs

    Network Requirements:

    • SSH access to the VM
    • Hostname resolution configured

    Installation Steps

    1. System Update and Dependencies

    Update the system and install required packages:

    Bash/ShellCopy
    1sudo apt update && sudo apt upgrade -y
    2
    3sudo apt install -y \
    4  build-essential \
    5  git \
    6  wget \
    7  curl \
    8  munge \
    9  libmunge-dev \
    10  mariadb-client \
    11  libmariadb-dev \
    12  libhwloc-dev \
    13  libjson-c-dev \
    14  libhttp-parser-dev \
    15  libyaml-dev \
    16  libjwt-dev \
    17  libdbus-1-dev \
    18  python3 \
    19  python3-pip

    2. MUNGE Authentication

    MUNGE provides authentication between Slurm components. All nodes in the cluster must have the same MUNGE key.

    Generate MUNGE Key

    Check if MUNGE key already exists:

    Bash/ShellCopy
    1ls -la /etc/munge/munge.key

    If the key doesn't exist, generate it manually:

    Bash/ShellCopy
    1# Create MUNGE key directory if it doesn't exist
    2sudo mkdir -p /etc/munge
    3
    4# Generate a new MUNGE key
    5sudo dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
    6
    7# Alternative: Use create-munge-key if available
    8sudo create-munge-key -f

    Set Correct Permissions

    MUNGE is very strict about file permissions. Set the correct ownership and permissions:

    Bash/ShellCopy
    1# Set ownership to munge user
    2sudo chown munge:munge /etc/munge/munge.key
    3
    4# Set restrictive permissions (only munge user can read)
    5sudo chmod 400 /etc/munge/munge.key
    6
    7# Verify permissions
    8ls -la /etc/munge/munge.key
    9# Expected output: -r-------- 1 munge munge 1024 <date> /etc/munge/munge.key

    Verify MUNGE Key

    Check that the key is valid and has correct format:

    Bash/ShellCopy
    1# Verify key is readable by munge
    2sudo -u munge cat /etc/munge/munge.key > /dev/null && echo "Key is readable" || echo "Key is NOT readable"

    Start and Enable MUNGE

    Bash/ShellCopy
    1# Enable MUNGE to start on boot
    2sudo systemctl enable munge
    3
    4# Start MUNGE service
    5sudo systemctl start munge
    6
    7# Verify MUNGE is running
    8sudo systemctl status munge

    Create Slurm user:

    Bash/ShellCopy
    1sudo groupadd -g 64030 slurm
    2sudo useradd -u 64030 -g slurm -s /bin/bash -d /var/lib/slurm slurm

    3. Slurm Installation

    Download and compile Slurm from source:

    Bash/ShellCopy
    1cd /tmp
    2wget https://download.schedmd.com/slurm/slurm-24.05.3.tar.bz2
    3tar -xjf slurm-24.05.3.tar.bz2
    4cd slurm-24.05.3
    5
    6./configure \
    7  --prefix=/usr \
    8  --sysconfdir=/etc/slurm \
    9  --with-munge \
    10  --with-hwloc \
    11  --with-json \
    12  --with-http-parser \
    13  --with-yaml \
    14  --with-jwt \
    15  --enable-pam
    16
    17make -j$(nproc)
    18sudo make install

    Verify installation:

    Bash/ShellCopy
    1slurmctld --version
    2# Output: slurm 24.05.3

    4. Configuration

    Create Slurm Directories

    Bash/ShellCopy
    1sudo mkdir -p /etc/slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
    2sudo chown -R slurm:slurm /etc/slurm /var/spool/slurm /var/log/slurm

    Configuration Files

    /etc/slurm/slurm.conf - Main Slurm configuration:

    Bash/ShellCopy
    1# Slurm Configuration File for Single-Node Setup
    2ClusterName=ori-slurm-poc
    3SlurmctldHost=virtual-machine
    4
    5# Authentication
    6AuthType=auth/munge
    7CryptoType=crypto/munge
    8
    9# Scheduling
    10SchedulerType=sched/backfill
    11SelectType=select/cons_tres
    12SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
    13
    14# Logging
    15SlurmctldDebug=info
    16SlurmctldLogFile=/var/log/slurm/slurmctld.log
    17SlurmdDebug=info
    18SlurmdLogFile=/var/log/slurm/slurmd.log
    19
    20# Process Tracking
    21ProctrackType=proctrack/cgroup
    22TaskPlugin=task/cgroup,task/affinity
    23
    24# GRES (GPU) support
    25GresTypes=gpu
    26
    27# State preservation
    28StateSaveLocation=/var/spool/slurm/ctld
    29SlurmdSpoolDir=/var/spool/slurm/d
    30
    31# Timeouts
    32SlurmctldTimeout=300
    33SlurmdTimeout=300
    34InactiveLimit=0
    35MinJobAge=300
    36KillWait=30
    37Waittime=0
    38
    39# Job Defaults
    40DefMemPerCPU=2048
    41
    42# Accounting
    43AccountingStorageType=accounting_storage/slurmdbd
    44AccountingStorageHost=localhost
    45AccountingStoragePort=6819
    46AccountingStorageEnforce=associations
    47JobAcctGatherType=jobacct_gather/linux
    48JobAcctGatherFrequency=30
    49
    50# Node Definitions (adjust CPUs and RealMemory based on your VM)
    51NodeName=virtual-machine CPUs=48 RealMemory=483000 Gres=gpu:h100:2 State=UNKNOWN
    52
    53# Partition Definitions
    54PartitionName=gpu Nodes=virtual-machine Default=YES MaxTime=INFINITE State=UP OverSubscribe=NO

    /etc/slurm/gres.conf - GPU resource configuration:

    Bash/ShellCopy
    1# GPU Resource Configuration
    2AutoDetect=nvml
    3NodeName=virtual-machine Name=gpu Type=h100 File=/dev/nvidia0
    4NodeName=virtual-machine Name=gpu Type=h100 File=/dev/nvidia1

    /etc/slurm/cgroup.conf - Resource isolation to prevent jobs using resources allocated to other jobs:

    Bash/ShellCopy
    1# Cgroup Configuration for Resource Isolation
    2ConstrainCores=yes
    3ConstrainDevices=yes
    4ConstrainRAMSpace=yes
    5ConstrainSwapSpace=yes

    Systemd Service Files

    /etc/systemd/system/slurmctld.service - Controller daemon:

    Bash/ShellCopy
    1[Unit]
    2Description=Slurm controller daemon
    3After=network.target munge.service
    4Requires=munge.service
    5
    6[Service]
    7Type=simple
    8User=root
    9Group=root
    10ExecStart=/usr/sbin/slurmctld -D
    11ExecReload=/bin/kill -HUP $MAINPID
    12KillMode=process
    13LimitNOFILE=131072
    14LimitMEMLOCK=infinity
    15LimitSTACK=infinity
    16Restart=on-failure
    17
    18[Install]
    19WantedBy=multi-user.target

    /etc/systemd/system/slurmd.service - Compute node daemon:

    Bash/ShellCopy
    1[Unit]
    2Description=Slurm node daemon
    3After=network.target munge.service
    4Requires=munge.service
    5
    6[Service]
    7Type=simple
    8User=root
    9Group=root
    10ExecStart=/usr/sbin/slurmd -D
    11ExecReload=/bin/kill -HUP $MAINPID
    12KillMode=process
    13LimitNOFILE=131072
    14LimitMEMLOCK=infinity
    15LimitSTACK=infinity
    16Restart=on-failure
    17
    18[Install]
    19WantedBy=multi-user.target

    5. Database Setup for Job Accounting

    Install MariaDB and configure Slurm Database Daemon (slurmdbd) for persistent job tracking.

    Install MariaDB

    Bash/ShellCopy
    1sudo apt install -y mariadb-server mariadb-client
    2sudo systemctl enable mariadb
    3sudo systemctl start mariadb

    Create Slurm Accounting Database

    Bash/ShellCopy
    1sudo mysql -e "CREATE DATABASE slurm_acct_db;"
    2sudo mysql -e "CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurmdbpass';"
    3sudo mysql -e "GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';"
    4sudo mysql -e "FLUSH PRIVILEGES;"

    Configure slurmdbd

    /etc/slurm/slurmdbd.conf:

    Bash/ShellCopy
    1# Slurm Database Daemon Configuration
    2AuthType=auth/munge
    3DbdHost=localhost
    4DebugLevel=info
    5LogFile=/var/log/slurm/slurmdbd.log
    6PidFile=/var/run/slurmdbd.pid
    7SlurmUser=slurm
    8
    9# Database connection
    10StorageType=accounting_storage/mysql
    11StorageHost=localhost
    12StoragePort=3306
    13StorageUser=slurm
    14StoragePass=slurmdbpass
    15StorageLoc=slurm_acct_db

    Set permissions

    Bash/ShellCopy
    1sudo chown slurm:slurm /etc/slurm/slurmdbd.conf
    2sudo chmod 600 /etc/slurm/slurmdbd.conf

    /etc/systemd/system/slurmdbd.service:

    Bash/ShellCopy
    1[Unit]
    2Description=Slurm Database Daemon
    3After=network.target munge.service mariadb.service
    4Requires=munge.service mariadb.service
    5
    6[Service]
    7Type=simple
    8User=slurm
    9Group=slurm
    10ExecStart=/usr/sbin/slurmdbd -D
    11ExecReload=/bin/kill -HUP $MAINPID
    12KillMode=process
    13LimitNOFILE=131072
    14Restart=on-failure
    15
    16[Install]
    17WantedBy=multi-user.target

    Starting Services

    Start all Slurm services in the correct order:

    Bash/ShellCopy
    1# Reload systemd
    2sudo systemctl daemon-reload
    3
    4# Start slurmdbd (database daemon)
    5sudo systemctl enable slurmdbd
    6sudo systemctl start slurmdbd
    7
    8# Start slurmctld (controller)
    9sudo systemctl enable slurmctld
    10sudo systemctl start slurmctld
    11
    12# Start slurmd (compute node)
    13sudo systemctl enable slurmd
    14sudo systemctl start slurmd
    15
    16# Verify services
    17sudo systemctl status slurmdbd
    18sudo systemctl status slurmctld
    19sudo systemctl status slurmd

    Setup Accounting

    Bash/ShellCopy
    1# Add cluster to accounting
    2sudo sacctmgr -i add cluster ori-slurm-poc
    3
    4# Create default account
    5sudo sacctmgr -i add account default Description='Default Account' Organization='Ori'
    6
    7# Add user to account
    8sudo sacctmgr -i add user ubuntu Account=default
    9
    10# Verify
    11sacctmgr list associations

    Activate the Node

    Bash/ShellCopy
    1# Update node state
    2sudo scontrol update nodename=virtual-machine state=resume
    3
    4# Clear any error messages
    5sudo scontrol update nodename=virtual-machine reason="Node operational"
    6
    7# Verify cluster status
    8sinfo
    9scontrol show node virtual-machine

    Expected output:

    Job Submission

    Single Job Submission

    Script: ~/scripts/submit-3min-job.sh

    This script creates and submits a 3-minute test job that uses both GPUs.

    Bash/ShellCopy
    1#!/bin/bash
    2# Script to create and submit a 3-minute test job
    3
    4TIMESTAMP=$(date +%s)
    5JOB_NAME="test-job-${TIMESTAMP}"
    6JOB_FILE="/tmp/${JOB_NAME}.sbatch"
    7
    8echo "Creating job script: ${JOB_NAME}"
    9
    10cat > ${JOB_FILE} << "JOBEND"
    11#!/bin/bash
    12#SBATCH --job-name=test-3min
    13#SBATCH --output=/tmp/test-3min-%j.out
    14#SBATCH --ntasks=1
    15#SBATCH --cpus-per-task=8
    16#SBATCH --gres=gpu:h100:2
    17#SBATCH --time=00:10:00
    18#SBATCH --account=default
    19
    20echo "=== Job Information ==="
    21echo "Job ID: $SLURM_JOB_ID"
    22echo "Job Name: $SLURM_JOB_NAME"
    23echo "Node: $SLURMD_NODENAME"
    24echo "GPUs allocated: $CUDA_VISIBLE_DEVICES"
    25echo "CPUs allocated: $SLURM_CPUS_PER_TASK"
    26echo "Start time: $(date)"
    27echo ""
    28
    29echo "=== GPU Information ==="
    30nvidia-smi --query-gpu=index,name,memory.total --format=csv
    31echo ""
    32
    33echo "=== Running for 3 minutes ==="
    34for i in {1..180}; do
    35  if [ $((i % 30)) -eq 0 ]; then
    36    echo "Progress: $i/180 seconds"
    37  fi
    38  sleep 1
    39done
    40
    41echo ""
    42echo "=== Job Complete ==="
    43echo "End time: $(date)"
    44JOBEND
    45
    46echo "Submitting job to Slurm..."
    47sbatch ${JOB_FILE}

    Usage:

    Bash/ShellCopy
    1chmod +x ~/scripts/submit-3min-job.sh
    2~/scripts/submit-3min-job.sh

    Dependent Job Submission

    Script: ~/scripts/submit-dependent-jobs.sh

    This script demonstrates job chaining where Job B waits for Job A to complete successfully.

    Bash/ShellCopy
    1#!/bin/bash
    2# Script to submit two jobs with dependency: Job B starts after Job A completes
    3
    4echo "=== Creating Job A (runs for 1 minute) ==="
    5
    6cat > /tmp/job-a.sbatch << "JOBA"
    7#!/bin/bash
    8#SBATCH --job-name=job-A
    9#SBATCH --output=/tmp/job-a-%j.out
    10#SBATCH --ntasks=1
    11#SBATCH --cpus-per-task=4
    12#SBATCH --gres=gpu:h100:1
    13#SBATCH --time=00:10:00
    14#SBATCH --account=default
    15
    16echo "=== Job A Started ==="
    17echo "Job ID: $SLURM_JOB_ID"
    18echo "Start time: $(date)"
    19echo "Running for 1 minute..."
    20
    21for i in {1..60}; do
    22  if [ $((i % 15)) -eq 0 ]; then
    23    echo "Job A progress: $i/60 seconds"
    24  fi
    25  sleep 1
    26done
    27
    28echo "Job A completed at: $(date)"
    29JOBA
    30
    31echo "=== Creating Job B (runs for 2 minutes, depends on Job A) ==="
    32
    33cat > /tmp/job-b.sbatch << "JOBB"
    34#!/bin/bash
    35#SBATCH --job-name=job-B
    36#SBATCH --output=/tmp/job-b-%j.out
    37#SBATCH --ntasks=1
    38#SBATCH --cpus-per-task=4
    39#SBATCH --gres=gpu:h100:1
    40#SBATCH --time=00:10:00
    41#SBATCH --account=default
    42
    43echo "=== Job B Started ==="
    44echo "Job ID: $SLURM_JOB_ID"
    45echo "Start time: $(date)"
    46echo "Running for 2 minutes..."
    47
    48for i in {1..120}; do
    49  if [ $((i % 30)) -eq 0 ]; then
    50    echo "Job B progress: $i/120 seconds"
    51  fi
    52  sleep 1
    53done
    54
    55echo "Job B completed at: $(date)"
    56JOBB
    57
    58echo ""
    59echo "=== Submitting Job A ==="
    60JOB_A_OUTPUT=$(sbatch /tmp/job-a.sbatch)
    61JOB_A_ID=$(echo $JOB_A_OUTPUT | awk '{print $4}')
    62echo "Job A submitted: $JOB_A_OUTPUT"
    63echo "Job A ID: $JOB_A_ID"
    64
    65echo ""
    66echo "=== Submitting Job B (depends on Job A completing) ==="
    67JOB_B_OUTPUT=$(sbatch --dependency=afterok:$JOB_A_ID /tmp/job-b.sbatch)
    68JOB_B_ID=$(echo $JOB_B_OUTPUT | awk '{print $4}')
    69echo "Job B submitted: $JOB_B_OUTPUT"
    70echo "Job B ID: $JOB_B_ID"
    71
    72echo ""
    73echo "=== Job Dependency Summary ==="
    74echo "Job A (ID: $JOB_A_ID) - Will run immediately"
    75echo "Job B (ID: $JOB_B_ID) - Will wait for Job A to complete successfully"
    76echo ""
    77echo "To monitor: squeue"
    78echo "Job A output: /tmp/job-a-$JOB_A_ID.out"
    79echo "Job B output: /tmp/job-b-$JOB_B_ID.out"

    Usage:

    Bash/ShellCopy
    1chmod +x ~/scripts/submit-dependent-jobs.sh
    2~/scripts/submit-dependent-jobs.sh

    Dependency Types:

    • --dependency=afterok:JOBID - Start after job completes successfully (exit code 0)
    • --dependency=after:JOBID - Start after job completes (any exit code)
    • --dependency=afternotok:JOBID - Start only if job fails
    • --dependency=afterany:JOBID - Start after job ends (any state)

    Monitoring and Management

    Check Job Queue

    Bash/ShellCopy
    1# View running and pending jobs
    2squeue
    3
    4# Detailed queue information
    5squeue -l
    6
    7# Custom format
    8squeue -o '%.8i %.12j %.10u %.8T %.10M %.6D %.20b %R'

    Check Job History

    Bash/ShellCopy
    1# View completed jobs (today)
    2sacct
    3
    4# View all jobs from specific date
    5sacct --starttime=2026-01-05
    6
    7# Show only main jobs (no steps)
    8sacct -X --format=JobID,JobName,State,AllocCPUS,Elapsed,Start,End
    9
    10# Specific job details
    11sacct -j <job_id> --format=JobID,JobName,State,AllocCPUS,AllocTRES,Elapsed

    Cluster Status

    Bash/ShellCopy
    1# View partition status
    2sinfo
    3
    4# Detailed node information
    5scontrol show node virtual-machine
    6
    7# Show partition details
    8scontrol show partition gpu

    Job Control

    Bash/ShellCopy
    1# Cancel a job
    2scancel <job_id>
    3
    4# Hold a pending job (prevent from starting)
    5scontrol hold <job_id>
    6
    7# Release a held job
    8scontrol release <job_id>
    9
    10# Suspend a running job (pause)
    11scontrol suspend <job_id>
    12
    13# Resume a suspended job
    14scontrol resume <job_id>

    View Job Output

    Bash/ShellCopy
    1# Tail output file while job is running
    2tail -f /tmp/test-3min-<job_id>.out
    3
    4# View completed job output
    5cat /tmp/test-3min-<job_id>.out


    If you run into any issues during setup or operation, this troubleshooting guide walks you through the most common Slurm pitfalls and how to resolve them.

    Summary

    You now have a fully functional Slurm cluster on a single Ori VM with:

    • Controller and compute functionality
    • GPU scheduling (2x H100)
    • Job accounting with MariaDB
    • Job dependency support
    • Historical job tracking

    Quick Command Reference

    Bash/ShellCopy
    1# Submit job
    2sbatch myjob.sbatch
    3
    4# Monitor jobs
    5squeue
    6
    7# View history
    8sacct --starttime=today
    9
    10# Check cluster
    11sinfo
    12
    13# Cancel job
    14scancel <job_id>
    15
    16# Node status
    17scontrol show node

    Leverage the power of Slurm on Ori

    Slurm gives you the operational backbone for GPU compute: consistent scheduling semantics, enforceable isolation, and the ability to build dependable pipelines with accounting that stands up to audits and cost attribution. Ori GPU Instances make that experience even more practical: you get a clean VM-based environment for Slurm that’s easy to provision, easy to reproduce, and powerful enough to run serious AI workloads (including multi-GPU jobs).

    If your goal is to run AI jobs with less waste, more control, and clearer visibility, Slurm on Ori is a straightforward path: proven scheduling, modern GPU infrastructure, and a setup you can take from “first node” to cluster-scale patterns with minimal friction.


    Share