Tutorials

How to run Qwen 3 235B on a cloud GPU

Deepak Manoor
Posted : May, 6, 2025
Posted : May, 6, 2025
    How to run Qwen 3 235B

    Alibaba’s Qwen series of AI models has rapidly emerged as a strong open-source alternative to state-of-the-art (SOTA) models often rivaling and in some benchmarks exceeding their performance. The latest version of these models, Qwen 3 offers a versatile family of generative AI models that blend high performance with broad accessibility. These models are designed with hybrid reasoning capabilities, allowing them to efficiently handle simple tasks while dynamically shifting to tackle more complex problems. The Qwen 3 lineup includes both dense and Mixture-of-Experts (MoE) architectures, ranging from 0.6 billion to 235 billion parameters, all available under the permissive Apache 2.0 license. Here’s a brief overview of Qwen 3’s key specifications:

    Qwen 3
    ArchitectureDense and Mixture-of-Experts (MoE) Transformers; Hybrid Reasoning Modes (Thinking & Non-Thinking)
    ParametersDense: 0.6B, 1.7B, 4B, 8B, 14B, 32B; MoE: 30B (3B active), 235B (22B active)
    Model VariantsDense, MoE
    Context length / Generation lengthDense (0.6B-4B): 32K tokens; Dense (8B-32B) & MoE: 128K tokens
    LicensingApache 2.0: Commercial and research

    Performance benchmarks from Artifical Analysis indicate that Qwen 3 235B A22B compares well with other top of the line models from Open AI, Google and DeepSeek.

    Qwen 3 Performance

    Source: Artificial Analysis

    How to run Qwen 3 with Ollama

    Pre-requisites

    Create a GPU virtual machine (VM) on Ori Global Cloud. We chose a set up with 4x NVIDIA H100 SXM and Ubuntu 22.04 as our OS, however 2x H100’s are enough since Ollama needs about 143 GB of VRAM to run Qwen 3 235B.

    Quick tip

    Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

    Step 1: SSH into your VM, install Python and create a virtual environment

    Bash/ShellCopy
    1apt install python3.12-venv
    2python3.12 -m venv qwen-env

    Step 2: Activate the virtual environment

    Bash/ShellCopy
    1source qwen-env/bin/activate

    Step 3: Install Ollama and specify the number of GPUs to be used

    Bash/ShellCopy
    1curl -fsSL https://ollama.com/install.sh | sh 
    2export OLLAMA_GPU_COUNT=4

    Step 4: Run Qwen 3 235B with Ollama

    Bash/ShellCopy
    1ollama run qwen3:235b –verbose

    Here’s what our setup looks like with Ollama running

    Qwen 3 GPU Setup

    Step 5: Install OpenWebui on the VM via another terminal window and run it

    Bash/ShellCopy
    1pip install open-webui 
    2open-webui serve

    Step 6: Access OpenWebUI on your browser through the default 8080 port.

    http://”VM-IP”:8080/

    Click on “Get Started” to create an Open WebUI account, if you haven’t installed it on the virtual machine before.

    Qwen 3 OpenwebUI

    Step 7: Choose qwen3:235b from the Models drop down and chat away!

    Comparing Thinking and Non-Thinking modes

    Being a hybrid model, Qwen 3 235B A22B is able to switch between thinking and non-thinking modes. Append “/no_think” or “/think” tag in your prompts to choose the mode you want to use.

    Here is a comparison of thinking and non thinking responses to our prompt

    Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

    Qwen 3 got the answer (3) right in both modes. However, the thinking mode took far too long (4m 16s vs 15s) with the model second-guessing itself continuously.

    Thinking Mode:

    Qwen 3 Thinking Mode

    Non-thinking Mode:

    Qwen 3 Non thinking

    Prompt: What is larger: 134.59 or 134.6?

    Although both modes returned the correct answer that 134.6 is larger, the thinking variant took 12 times more time than the non-thinking ones.

    Thinking Mode:

    Qwen 3 Thinking Performance

    Non-thinking Mode:

    Qwen 3 Performance Non-thinking

    Our thoughts on Qwen 3

    Speed

    We tried a few coding and math prompts on Qwen 3 with Ollama’s verbose mode. In terms of speed, we noticed strong performance with 23-25 tokens per second, when running on our NVIDIA H100 SXM setup.

    Accuracy

    Qwen 3 got most of our prompts right such as Python code to generate Snake and Tetris games.

    However, it did struggle with the prompt below Prompt: "write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, and it must bounce off the rotating walls realistically" The Python code created the visual where ball was either bouncing outside the hexagon.

    Qwen 3 - bouncing ball in a spinning hexagon

    Reasoning

    Qwen 3’s hybrid operation( thinking and non-thinking), where users can turn on the thinking mode only for very hard problems. However, Qwen 3 is prone to “overthinking”, which means it tends to reason for too long even when encountering fairly straightforward prompts. For example, for the math problem below, Qwen 3 reasoned for several minutes longer than DeepSeek R1 70B Distill.

    Qwen Math Reasoning
    Qwen 3 Math Good
    Qwen 3 Problem Solving
    Qwen 3 math solving

    Qwen 3 is an impressive step forward for open-source AI. It’s fast, flexible, and capable of handling everything from simple queries to complex reasoning, thanks to its hybrid architecture. Running the 235B model on Ori’s H100 GPU instances with Ollama was smooth and efficient, even with its hefty requirements. The ability to toggle between "thinking" and "non-thinking" modes gives users control over speed and depth, though it’s clear the model can sometimes overthink when it doesn’t need to. For teams looking to experiment, build, or deploy powerful AI models on secure infrastructure, Qwen 3 on Ori is a solid combination.

    Chart your own AI reality with Ori

    Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways:

    Share