How to run NVIDIA Nemotron 3 Nano on a cloud GPU with vLLM

NVIDIA has consistently expanded the Nemotron family over the past year, releasing models that target specific bottlenecks in the AI development lifecycle. We saw the Nemotron-4 340B address synthetic data generation, followed closely by the Llama-3.1-Nemotron reward models, which optimized alignment and RLHF workflows. The latest addition, the Nemotron 3 family, specifically Nemotron 3 Nano, represents a shift in focus toward architectural efficiency and inference throughput.
Rather than simply scaling up parameter counts, NVIDIA is utilizing a hybrid Mamba-Transformer architecture paired with a Mixture-of-Experts (MoE) design. This approach attempts to balance the long-context capabilities of State Space Models (SSMs) with the established reasoning performance of Transformers. The practical goal is to provide strong multi-step reasoning capabilities in a model that is computationally lighter to run, making it more accessible for deployment and optimizing inference efficiency.
Here is a summary of the Nemotron 3 Nano specifications:
| Attribute | Nemotron 3 Nano |
|---|---|
| Architecture | Hybrid Mamba-Transformer Mixture-of-Experts (MoE) |
| Parameters | 30B Total (approx. 3.2B Active) |
| Context Window | 1M tokens |
| License | NVIDIA Open Model License ( Commercial and Non-commercial) |
According to NVIDIA's benchmarks, Nemotron 3 Nano performs competitively against models such as Qwen 3 and gpt-oss-20b in reasoning tasks. Because of the MoE architecture with its sparse activation design, it activates only a fraction of its parameters per token, which helps reduce the memory requirements during inference.

How to run Nemotron 3 on an H100 GPU
Prerequisites
To get started, create a GPU virtual machine (VM) on Ori Global Cloud.
We have selected the NVIDIA H100 for this tutorial as it is a strong combination of availability and cost-efficiency in the market. While upgrading to H200 or B200 hardware would unlock superior performance and larger context windows, the steps outlined in this guide remain consistent across these architectures.
NVIDIA has released the post-trained and pre-trained BF16 variants as well as the quantized FP8 version. We’ll be running the leaner FP8 model on an H100 GPU for this tutorial.
Use the initialization script during VM creation to pre-install NVIDIA CUDA drivers, PyTorch.
Step 1: SSH into your VM and set up the environment
1apt install python3.12-venv
2python3.12 -m venv nemo-env
3source nemo-env/bin/activateStep 2: Install the latest vLLM
1pip install -U "vllm>=0.12.0"Step 3: Download the Nemotron 3 Parser
1wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/resolve/main/nano_v3_reasoning_parser.pyStep 4: Run the vLLM server
We will serve the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model.
1VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
2 --max-num-seqs 8 \
3 --tensor-parallel-size 1 \
4 --max-model-len 262144 \
5 --port 8000 \
6 --trust-remote-code \
7 --enable-auto-tool-choice \
8 --tool-call-parser qwen3_coder \
9 --reasoning-parser-plugin nano_v3_reasoning_parser.py \
10 --reasoning-parser nano_v3 \
11 --kv-cache-dtype fp8Here’s a snapshot of the GPU instance to show memory usage of about 73GB VRAM.

Step 4: Test the model with cURL
You can interact with the model using standard tools like curl
1curl http://VM-IP:8000/v1/chat/completions \
2 -H "Content-Type: application/json" \
3 -d '{
4 "model":"nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
5 "messages":[{"role": "user", "content": "Explain Jevon'\''s Paradox in a single sentence"}],
6 "chat_template_kwargs": {"enable_thinking": false}
7 }' | jq -r '."choices"[0]."message"."content"'Step 5: Install Jupyter Notebook for ease of interaction and run OpenAI Python SDK
1pip install notebook
2jupyter notebook --allow-root --no-browser --ip=0.0.0.01from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:8000/v1",
5 api_key="EMPTY"
6)
7messages=[
8 {"role": "system", "content": "You are a helpful assistant."},
9 {"role": "user", "content": "5.9 - 5.11"}
10 ]
11response = client.chat.completions.create(model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", messages=messages, extra_body={"chat_template_kwargs": {"enable_thinking": False}})
12
13print(response.choices[0].message.content)Step 5: Tool Calling with Nemotron 3 Nano
Call the built-in tip calculator tool. Similarly, you can use other recipes to build agentic workflows with Nemotron 3 Nano
1from openai import OpenAI
2
3client = OpenAI(
4 base_url="http://localhost:8000/v1",
5 api_key="EMPTY"
6)
7
8TOOLS = [
9 {
10 "type": "function",
11 "function": {
12 "name": "calculate_tip",
13 "parameters": {
14 "type": "object",
15 "properties": {
16 "bill_total": {
17 "type": "integer",
18 "description": "The total amount of the bill"
19 },
20 "tip_percentage": {
21 "type": "integer",
22 "description": "The percentage of tip to be applied"
23 }
24 },
25 "required": ["bill_total", "tip_percentage"]
26 }
27 }
28 }
29]
30messages=[
31 {"role": "system", "content": "You are a helpful assistant."},
32 {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"}
33 ]
34response = client.chat.completions.create(model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", messages=messages, tools=TOOLS, temperature=0.6, top_p=0.95, max_tokens=512, stream=False )
35
36#print(response.choices[0].message.content)
37
38print(response.choices[0].message.reasoning_content)
39#print(response.choices[0].message.tool_calls)How fast is Nemotron Nano 3?
We were very impressed with the generation speeds from Nemotron Nano 3, outpacing most other open-source models. On the server side, we saw vLLM output 223 tokens per second on an H100 GPU for a single request, demonstrating excellent inference throughput, especially when compared to the 158 tokens/second throughput we observed with gpt-oss-120b earlier this year.

We also created a visualization for the vLLM metrics based on this guide to collect the data in Prometheus coupled with a Grafana dashboard. This dashboard displayed generation speeds of 185 tokens/second. Although there is some variance in the numbers compared to the one provided by the vLLM terminal, it reinforces the high throughput capability of Nemotron's architecture and its efficient continuous batching.
Similarly, time-to-first-token (TTFT) numbers of less than 100 ms indicate a greater degree of responsiveness compared to models such as Qwen 3 and gpt-oss.


Initial Impressions
We ran a few standard tests to see how Nemotron 3 Nano handles common logic and coding prompts, including math reasoning tasks.
Tokenization & Logic:
Prompt: How many 'r's in “strawberry”?
Nemotron 3:

Prompt: How many 'l's in “strawberry”?
Nemotron 3: The model got this one wrong by stating that the word “strawberry” had one letter ‘l’

Mathematical Reasoning:
Prompt: Find all saddle points of the function $f(x, y) = x^3 + y^3 - 3x - 12y + 20$.
Nemotron 3: The model correctly applied the second derivative test and identified the saddle point without errors in the calculation steps.

Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections
Nemotron 3: The model got this one wrong, because the correct answer is 3 and not 11/4

Code Generation:
Prompt: We asked for Python code for the Snake game. Nemotron Nano 3 got it right at the first try with a perfect simulation of the game. Here’s a snapshot of the game
Prompt: Create an SVG of a smiling dog
Again, Nemotron 3 Nano 3 was able to create the SVG correctly on the first pass.
Overall, Nemotron 3 Nano performs reliably for its size. It demonstrates strong logical consistency and multi-step reasoning capabilities, suggesting that the hybrid architecture is effective at maintaining coherence without the computational cost of a dense 70B model.
Scale your AI on Ori
Deploying hybrid models like Nemotron 3 benefits significantly from robust, high-performance infrastructure. Ori’s AI Cloud provides the flexibility and top-tier compute required to support dynamic workloads, helping you bridge the gap between initial prototyping and production deployment.
- GPU Instances: Gain instant access to top-tier GPUs required for efficient inference.
- Supercomputers: Instant, bare-metal GPU clusters connected by Infiniband networking.
- Inference Endpoints: Integrate the latest open-source models into your applications via scalable, low-latency APIs.
- GPU Clusters: Orchestrate high-performance clusters for fine-tuning or training foundation models at scale.
- Ori AI Fabric: License the same platform that powers Ori AI Cloud to build your own AI-centric, GPU compute cloud


