
OpenAI catalyzed the generative AI movement with the release of ChatGPT in 2022. In the years since, ChatGPT models have achieved remarkable scale, now being used by over 700 million users weekly. At the same time, open-source models like Llama, DeepSeek, and Qwen have risen as compelling alternatives to proprietary systems. Marking a major shift, OpenAI has now returned to open source for the first time since GPT-2 in 2019, unveiling two new open weight models: gpt-oss-120b and gpt-oss-20b.
Here’s a brief overview of gpt-oss’ key specifications:
OpenAI gpt-oss-120b and gpt-oss-20b large language models | |
---|---|
Architecture | Mixture-of-experts (MoE) with Rotary Positional Embedding (RoPE) for positional encoding, Native MXFP4 Quantization for memory efficiency |
Tool Calling & Agents | Tool (function) calling works via both the Responses and Chat Completions APIs. Supports Agents SDK and Harmony response format |
Model Variants | gpt-oss-120b : 117B parameters of which 5.1B are activegpt-oss-20b: 21B parameters of which 3.6B are active |
Context Length | 128 k tokens |
Performance benchmarks indicate that gpt-oss-120b surpasses popular open source models such as Qwen 3 235B and Llama 4 Maverick in terms of intelligence and is one of the fastest models in terms of output speed.

Source: Artificial Analysis
How to run gpt-oss-120b with vLLM on H100 GPUs
Pre-requisites to self-host gpt-oss
Create a GPU virtual machine (VM) on Ori Global Cloud. We recommend using 2x NVIDIA H100 GPUs to use the maximum context window and optimize GPU utilization. Unfortunately, to run gpt-oss-120b on a single H100 without further quantization, you’ll need to increase GPU utilization to 95% and reduce context window to 1024 tokens.
We chose vLLM to run the models because it provides flexible parallelism across 2, 4, or 8 GPUs to scale throughput efficiently. It also features high-performance attention and MoE kernels tailored for attention sinks and sliding window patterns, and employs asynchronous scheduling to maximize hardware utilization by overlapping CPU and GPU operations.
Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.
Step 1: SSH into your VM, let’s install the uv package manager since vLLM recommends it
1pip install uv
Step 2: create a virtual environment
1uv venv --python 3.12 --seed
2source .venv/bin/activate
Step 3: Install the latest version of vllm that supports gpt-oss
1uv pip install --pre vllm==0.10.1+gptoss \
2 --extra-index-url https://wheels.vllm.ai/gpt-oss/ \
3 --extra-index-url https://download.pytorch.org/whl/nightly/cu128 \
4 --index-strategy unsafe-best-match
Step 4: Run the vLLM server. It’ll take a while for the first time to load the model weights, especially for gpt-oss-120b
1vllm serve openai/gpt-oss-120b --tensor-parallel-size 2
Note: Since we used 2 H100s we’ve leveraged tensor parallelism to use both GPUs. IF it’s a single GPU, drop the option.
Step 5: We installed Jupyter to make it easy to run the prompts, however you can run them from the terminal either as Python files or via cURL commands.
1uv pip install notebook
2jupyter notebook --allow-root --no-browser --ip=0.0.0.0
To control the degree of Chain-of-thought (CoT) reasoning, users can set the level of reasoning in system prompts as high, medium or low, e.g., "Reasoning: high".
The vLLM team has created a demo for tool calling and also recommends using an MCP client approach for using tools in production workloads. Check out OpenAI's cookbook on Github to see how to integrate these models with Agents SDK.
The model card on Hugging Face also provides several options to self-host gpt-oss with tools such as Ollama, Transformers and LM Studio.
How good is gpt-oss?
Verbal Analysis:
We ran a few verbal analysis prompts to test gpt-oss-120b and it performed very well


Math Problems:
gpt-oss-120b got the right answers to all the math problems we asked. In our opinion, the math performance was better than Magistral and Mistral small and similar to Qwen 235B.
Prompt: Find all saddle points of the function f(x, y) = x³ + y³ - 3x - 12y + 20.

Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections
Response Summary:
So the region enclosed by \(y=x\), \(y=2x\) and \(y=6-x\) has an area of **3 square units**.
Prompt: What is larger: 134.59 or 134.6?

Coding Prompts:
Although gpt-oss showed strong coding performance and was more capable than models such as Llama 4 and Magistral Small, we felt that it fell a little short of Qwen 235B’s one-shot accuracy.
Prompt: Write the Snake game in Python
The code from gpt-oss compiled easily, but theoutput was not entirely as per the rules of the classic Snake game
Prompt: Write a program in Python to create the tetris game
The model returned a code that created an excellent version of the Tetris game.
Prompt: Create an SVG of a smiling dog

Gpt-oss got this SVG nearly right. The xmls field was “http://www.w3.org/2000/svg\”. Once the ‘\’ character in the end was removed the SVG loaded perfectly.
How fast is gpt-oss?
gpt-oss-120b is extraordinarily fast when compared to most open source models.
Here are some example prompts and the throughput numbers from the vLLM log
Prompt: Explain what MXFP4 quantization is.
Avg generation throughput: 31.4 tokens/second

Prompt: Explain the Collatz Conjecture in detail.
Avg generation throughput: 158.4 tokens/second

Build limitless AI on Ori
Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications
- Deploy Private Clouds for flexible and secure enterprise AI.
- Leverage GPU Instances as on-demand virtual machines.
- Operate Inference Endpoints effortlessly at any scale.
- Scale GPU Clusters for training and inference.
- Manage AI workloads on Serverless Kubernetes without infrastructure overhead.