How to run OpenAI gpt-oss with vLLM on a cloud GPU

Run OpenAI gpt-oss-120b on NVIDIA H100 GPU

OpenAI catalyzed the generative AI movement with the release of ChatGPT in 2022. In the years since, ChatGPT models have achieved remarkable scale, now being used by over 700 million users weekly. At the same time, open-source models like Llama, DeepSeek, and Qwen have risen as compelling alternatives to proprietary systems. Marking a major shift, OpenAI has now returned to open source for the first time since GPT-2 in 2019, unveiling two new open weight models: gpt-oss-120b and gpt-oss-20b.

Here’s a brief overview of gpt-oss’ key specifications:

	OpenAI gpt-oss-120b and gpt-oss-20b large language models
Architecture	Mixture-of-experts (MoE) with Rotary Positional Embedding (RoPE) for positional encoding, Native MXFP4 Quantization for memory efficiency
Tool Calling & Agents	Tool (function) calling works via both the Responses and Chat Completions APIs. Supports Agents SDK and Harmony response format
Model Variants	gpt-oss-120b : 117B parameters of which 5.1B are activegpt-oss-20b: 21B parameters of which 3.6B are active
Context Length	128 k tokens

Performance benchmarks indicate that gpt-oss-120b surpasses popular open source models such as Qwen 3 235B and Llama 4 Maverick in terms of intelligence and is one of the fastest models in terms of output speed.

OpenAO gpt-oss-20b and gpt-oss-120b performance benchmarks

Source: Artificial Analysis

Connect with Ori
on Discord

Join the Communityhttps://discord.gg/2VrezwZBAR

How to run gpt-oss-120b with vLLM on H100 GPUs

Pre-requisites to self-host gpt-oss

Create a GPU virtual machine (VM) on Ori Global Cloud. We recommend using 2x NVIDIA H100 GPUs to use the maximum context window and optimize GPU utilization. Unfortunately, to run gpt-oss-120b on a single H100 without further quantization, you’ll need to increase GPU utilization to 95% and reduce context window to 1024 tokens.

We chose vLLM to run the models because it provides flexible parallelism across 2, 4, or 8 GPUs to scale throughput efficiently. It also features high-performance attention and MoE kernels tailored for attention sinks and sliding window patterns, and employs asynchronous scheduling to maximize hardware utilization by overlapping CPU and GPU operations.

Quick Tip

Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

Step 1: SSH into your VM, let’s install the uv package manager since vLLM recommends it

Bash/ShellCopy

1pip install uv

Step 2: create a virtual environment

Bash/ShellCopy

1uv venv --python 3.12 --seed
2source .venv/bin/activate

Step 3: Install the latest version of vllm that supports gpt-oss

Bash/ShellCopy

1uv pip install --pre vllm==0.10.1+gptoss \
2    --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
3    --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
4    --index-strategy unsafe-best-match

Step 4: Run the vLLM server. It’ll take a while for the first time to load the model weights, especially for gpt-oss-120b

Bash/ShellCopy

1vllm serve openai/gpt-oss-120b --tensor-parallel-size 2

Note: Since we used 2 H100s we’ve leveraged tensor parallelism to use both GPUs. IF it’s a single GPU, drop the option.

Step 5: We installed Jupyter to make it easy to run the prompts, however you can run them from the terminal either as Python files or via cURL commands.

Bash/ShellCopy

1uv pip install notebook
2jupyter notebook --allow-root --no-browser --ip=0.0.0.0

To control the degree of Chain-of-thought (CoT) reasoning, users can set the level of reasoning in system prompts as high, medium or low, e.g., "Reasoning: high".

The vLLM team has created a demo for tool calling and also recommends using an MCP client approach for using tools in production workloads. Check out OpenAI's cookbook on Github to see how to integrate these models with Agents SDK.

The model card on Hugging Face also provides several options to self-host gpt-oss with tools such as Ollama, Transformers and LM Studio.

How good is gpt-oss?

Verbal Analysis:

We ran a few verbal analysis prompts to test gpt-oss-120b and it performed very well

Math Problems:

gpt-oss-120b got the right answers to all the math problems we asked. In our opinion, the math performance was better than Magistral and Mistral small and similar to Qwen 235B.

Prompt: Find all saddle points of the function f(x, y) = x³ + y³ - 3x - 12y + 20.

Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

Response Summary:

So the region enclosed by \(y=x\), \(y=2x\) and \(y=6-x\) has an area of **3 square units**.

Prompt: What is larger: 134.59 or 134.6?

Coding Prompts:

Although gpt-oss showed strong coding performance and was more capable than models such as Llama 4 and Magistral Small, we felt that it fell a little short of Qwen 235B’s one-shot accuracy.

Prompt: Write the Snake game in Python

The code from gpt-oss compiled easily, but theoutput was not entirely as per the rules of the classic Snake game

Prompt: Write a program in Python to create the tetris game

The model returned a code that created an excellent version of the Tetris game.

Prompt: Create an SVG of a smiling dog

Gpt-oss got this SVG nearly right. The xmls field was “http://www.w3.org/2000/svg\”. Once the ‘\’ character in the end was removed the SVG loaded perfectly.

How fast is gpt-oss?

gpt-oss-120b is extraordinarily fast when compared to most open source models.

Here are some example prompts and the throughput numbers from the vLLM log

Prompt: Explain what MXFP4 quantization is.

Avg generation throughput: 31.4 tokens/second

Prompt: Explain the Collatz Conjecture in detail.

Avg generation throughput: 158.4 tokens/second

Build limitless AI on Ori

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications

Deploy Private Clouds for flexible and secure enterprise AI.
Leverage GPU Instances as on-demand virtual machines.
Operate Inference Endpoints effortlessly at any scale.
Scale GPU Clusters for training and inference.
Manage AI workloads on Serverless Kubernetes without infrastructure overhead.

Deploy gpt-oss on Ori

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.

How to run OpenAI gpt-oss with vLLM on a cloud GPU

Source: Artificial Analysis

Connect with Orion Discord

How to run gpt-oss-120b with vLLM on H100 GPUs

Pre-requisites to self-host gpt-oss

Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

How good is gpt-oss?

Verbal Analysis:

Math Problems:

How fast is gpt-oss?

Build limitless AI on Ori

Subscribe to our newsletter

Build limitless AI on Ori

Connect with Ori
on Discord