How to run Qwen 3 235B on a cloud GPU

Alibaba’s Qwen series of AI models has rapidly emerged as a strong open-source alternative to state-of-the-art (SOTA) models often rivaling and in some benchmarks exceeding their performance. The latest version of these models, Qwen 3 offers a versatile family of generative AI models that blend high performance with broad accessibility. These models are designed with hybrid reasoning capabilities, allowing them to efficiently handle simple tasks while dynamically shifting to tackle more complex problems. The Qwen 3 lineup includes both dense and Mixture-of-Experts (MoE) architectures, ranging from 0.6 billion to 235 billion parameters, all available under the permissive Apache 2.0 license. Here’s a brief overview of Qwen 3’s key specifications:

Qwen 3
Architecture	Dense and Mixture-of-Experts (MoE) Transformers; Hybrid Reasoning Modes (Thinking & Non-Thinking)
Parameters	Dense: 0.6B, 1.7B, 4B, 8B, 14B, 32B; MoE: 30B (3B active), 235B (22B active)
Model Variants	Dense, MoE
Context length / Generation length	Dense (0.6B-4B): 32K tokens; Dense (8B-32B) & MoE: 128K tokens
Licensing	Apache 2.0: Commercial and research

Performance benchmarks from Artifical Analysis indicate that Qwen 3 235B A22B compares well with other top of the line models from Open AI, Google and DeepSeek.

Source: Artificial Analysis

How to run Qwen 3 with Ollama

Pre-requisites

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose a set up with 4x NVIDIA H100 SXM and Ubuntu 22.04 as our OS, however 2x H100’s are enough since Ollama needs about 143 GB of VRAM to run Qwen 3 235B.

Quick tip

Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

Step 1: SSH into your VM, install Python and create a virtual environment

Bash/ShellCopy

1apt install python3.12-venv
2python3.12 -m venv qwen-env

Step 2: Activate the virtual environment

Bash/ShellCopy

1source qwen-env/bin/activate

Step 3: Install Ollama and specify the number of GPUs to be used

Bash/ShellCopy

1curl -fsSL https://ollama.com/install.sh | sh 
2export OLLAMA_GPU_COUNT=4

Step 4: Run Qwen 3 235B with Ollama

Bash/ShellCopy

1ollama run qwen3:235b –verbose

Here’s what our setup looks like with Ollama running

Step 5: Install OpenWebui on the VM via another terminal window and run it

Bash/ShellCopy

1pip install open-webui 
2open-webui serve

Step 6: Access OpenWebUI on your browser through the default 8080 port.

http://”VM-IP”:8080/

Click on “Get Started” to create an Open WebUI account, if you haven’t installed it on the virtual machine before.

Step 7: Choose qwen3:235b from the Models drop down and chat away!

Comparing Thinking and Non-Thinking modes

Being a hybrid model, Qwen 3 235B A22B is able to switch between thinking and non-thinking modes. Append “/no_think” or “/think” tag in your prompts to choose the mode you want to use.

Here is a comparison of thinking and non thinking responses to our prompt

Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

Qwen 3 got the answer (3) right in both modes. However, the thinking mode took far too long (4m 16s vs 15s) with the model second-guessing itself continuously.

Thinking Mode:

Non-thinking Mode:

Prompt: What is larger: 134.59 or 134.6?

Although both modes returned the correct answer that 134.6 is larger, the thinking variant took 12 times more time than the non-thinking ones.

Thinking Mode:

Non-thinking Mode:

Our thoughts on Qwen 3

Speed

We tried a few coding and math prompts on Qwen 3 with Ollama’s verbose mode. In terms of speed, we noticed strong performance with 23-25 tokens per second, when running on our NVIDIA H100 SXM setup.

Accuracy

Qwen 3 got most of our prompts right such as Python code to generate Snake and Tetris games.

However, it did struggle with the prompt below Prompt: "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically" The Python code created the visual where ball was either bouncing outside the hexagon.

Qwen 3 - bouncing ball in a spinning hexagon

Reasoning

Qwen 3’s hybrid operation( thinking and non-thinking), where users can turn on the thinking mode only for very hard problems. However, Qwen 3 is prone to “overthinking”, which means it tends to reason for too long even when encountering fairly straightforward prompts. For example, for the math problem below, Qwen 3 reasoned for several minutes longer than DeepSeek R1 70B Distill.

Qwen 3 is an impressive step forward for open-source AI. It’s fast, flexible, and capable of handling everything from simple queries to complex reasoning, thanks to its hybrid architecture. Running the 235B model on Ori’s H100 GPU instances with Ollama was smooth and efficient, even with its hefty requirements. The ability to toggle between "thinking" and "non-thinking" modes gives users control over speed and depth, though it’s clear the model can sometimes overthink when it doesn’t need to. For teams looking to experiment, build, or deploy powerful AI models on secure infrastructure, Qwen 3 on Ori is a solid combination.

Chart your own AI reality with Ori

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways:

Deploy Private Clouds for flexible and secure enterprise AI.
Leverage GPU Instances as on-demand virtual machines.
Operate Inference Endpoints effortlessly at any scale.
Scale GPU Clusters for training and inference.
Manage AI workloads on Serverless Kubernetes without infrastructure overhead.

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.