Tutorials

How to run NVIDIA Nemotron 3 Nano on a cloud GPU with vLLM

Learn how to deploy NVIDIA Nemotron Nano 3 on production-grade GPUs using vLLM and measure real-world throughput and latency.
Deepak Manoor
Posted : December, 17, 2025
Posted : December, 17, 2025
    NVIDIA Nemotron Nano

    NVIDIA has consistently expanded the Nemotron family over the past year, releasing models that target specific bottlenecks in the AI development lifecycle. We saw the Nemotron-4 340B address synthetic data generation, followed closely by the Llama-3.1-Nemotron reward models, which optimized alignment and RLHF workflows. The latest addition, the Nemotron 3 family, specifically Nemotron 3 Nano, represents a shift in focus toward architectural efficiency and inference throughput.

    Rather than simply scaling up parameter counts, NVIDIA is utilizing a hybrid Mamba-Transformer architecture paired with a Mixture-of-Experts (MoE) design. This approach attempts to balance the long-context capabilities of State Space Models (SSMs) with the established reasoning performance of Transformers. The practical goal is to provide strong multi-step reasoning capabilities in a model that is computationally lighter to run, making it more accessible for deployment and optimizing inference efficiency.

    Here is a summary of the Nemotron 3 Nano specifications:

    AttributeNemotron 3 Nano
    ArchitectureHybrid Mamba-Transformer Mixture-of-Experts (MoE)
    Parameters30B Total (approx. 3.2B Active)
    Context Window1M tokens
    LicenseNVIDIA Open Model License ( Commercial and Non-commercial)

    According to NVIDIA's benchmarks, Nemotron 3 Nano performs competitively against models such as Qwen 3 and gpt-oss-20b in reasoning tasks. Because of the MoE architecture with its sparse activation design, it activates only a fraction of its parameters per token, which helps reduce the memory requirements during inference.

    Nemotron Nano vs Qwen 3 vs gpt-oss

    How to run Nemotron 3 on an H100 GPU

    Prerequisites

    To get started, create a GPU virtual machine (VM) on Ori Global Cloud.

    We have selected the NVIDIA H100 for this tutorial as it is a strong combination of availability and cost-efficiency in the market. While upgrading to H200 or B200 hardware would unlock superior performance and larger context windows, the steps outlined in this guide remain consistent across these architectures.

    NVIDIA has released the post-trained and pre-trained BF16 variants as well as the quantized FP8 version. We’ll be running the leaner FP8 model on an H100 GPU for this tutorial.

    Quick Tip

    Use the initialization script during VM creation to pre-install NVIDIA CUDA drivers, PyTorch.

    Step 1: SSH into your VM and set up the environment

    Bash/ShellCopy
    1apt install python3.12-venv
    2python3.12 -m venv nemo-env
    3source nemo-env/bin/activate

    Step 2: Install the latest vLLM

    Bash/ShellCopy
    1pip install -U "vllm>=0.12.0"

    Step 3: Download the Nemotron 3 Parser

    Bash/ShellCopy
    1wget https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8/resolve/main/nano_v3_reasoning_parser.py

    Step 4: Run the vLLM server

    We will serve the nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model.
    Bash/ShellCopy
    1VLLM_USE_FLASHINFER_MOE_FP8=1 vllm serve  --model nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 \
    2 --max-num-seqs 8 \
    3  --tensor-parallel-size 1 \
    4  --max-model-len 262144 \
    5  --port 8000 \
    6  --trust-remote-code \
    7  --enable-auto-tool-choice \
    8  --tool-call-parser qwen3_coder \
    9  --reasoning-parser-plugin nano_v3_reasoning_parser.py \
    10  --reasoning-parser nano_v3 \
    11  --kv-cache-dtype fp8

    Here’s a snapshot of the GPU instance to show memory usage of about 73GB VRAM.

    How much VRAM memory for Nemotron 3 Nano

    Step 4: Test the model with cURL

    You can interact with the model using standard tools like curl

    Bash/ShellCopy
    1curl http://VM-IP:8000/v1/chat/completions \
    2    -H "Content-Type: application/json" \
    3    -d '{
    4        "model":"nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8",
    5        "messages":[{"role": "user", "content": "Explain Jevon'\''s Paradox in a single sentence"}],
    6        "chat_template_kwargs": {"enable_thinking": false}
    7    }' | jq -r '."choices"[0]."message"."content"'

    Step 5: Install Jupyter Notebook for ease of interaction and run OpenAI Python SDK

    Bash/ShellCopy
    1pip install notebook
    2jupyter notebook --allow-root --no-browser --ip=0.0.0.0
    PythonCopy
    1from openai import OpenAI
    2
    3client = OpenAI(
    4    base_url="http://localhost:8000/v1",
    5    api_key="EMPTY"
    6)
    7messages=[
    8        {"role": "system", "content": "You are a helpful assistant."},
    9        {"role": "user", "content": "5.9 - 5.11"}
    10    ]
    11response = client.chat.completions.create(model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", messages=messages, extra_body={"chat_template_kwargs": {"enable_thinking": False}})
    12
    13print(response.choices[0].message.content)

    Step 5: Tool Calling with Nemotron 3 Nano

    Call the built-in tip calculator tool. Similarly, you can use other recipes to build agentic workflows with Nemotron 3 Nano

    PythonCopy
    1from openai import OpenAI
    2
    3client = OpenAI(
    4    base_url="http://localhost:8000/v1",
    5    api_key="EMPTY"
    6)
    7
    8TOOLS = [
    9    {
    10        "type": "function",
    11        "function": {
    12            "name": "calculate_tip",
    13            "parameters": {
    14                "type": "object",
    15                "properties": {
    16                    "bill_total": {
    17                        "type": "integer",
    18                        "description": "The total amount of the bill"
    19                    },
    20                    "tip_percentage": {
    21                        "type": "integer",
    22                        "description": "The percentage of tip to be applied"
    23                    }
    24                },
    25                "required": ["bill_total", "tip_percentage"]
    26            }
    27        }
    28    }
    29]
    30messages=[
    31        {"role": "system", "content": "You are a helpful assistant."},
    32        {"role": "user", "content": "My bill is $50. What will be the amount for 15% tip?"}
    33    ]
    34response = client.chat.completions.create(model="nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8", messages=messages, tools=TOOLS, temperature=0.6, top_p=0.95, max_tokens=512, stream=False )
    35
    36#print(response.choices[0].message.content)
    37
    38print(response.choices[0].message.reasoning_content)
    39#print(response.choices[0].message.tool_calls)


    background image

    Join the
    Ori community on Discord

    Join the Communityhttps://discord.gg/2VrezwZBAR
    NVIDIA Rubin GPU

    How fast is Nemotron Nano 3?

    We were very impressed with the generation speeds from Nemotron Nano 3, outpacing most other open-source models. On the server side, we saw vLLM output 223 tokens per second on an H100 GPU for a single request, demonstrating excellent inference throughput, especially when compared to the 158 tokens/second throughput we observed with gpt-oss-120b earlier this year.

    Nemotron Nano 3 Tokens per second

    We also created a visualization for the vLLM metrics based on this guide to collect the data in Prometheus coupled with a Grafana dashboard. This dashboard displayed generation speeds of 185 tokens/second. Although there is some variance in the numbers compared to the one provided by the vLLM terminal, it reinforces the high throughput capability of Nemotron's architecture and its efficient continuous batching.

    Similarly, time-to-first-token (TTFT) numbers of less than 100 ms indicate a greater degree of responsiveness compared to models such as Qwen 3 and gpt-oss.

    Nemotron 3 Nano tokens per second
    Nemotron 3 Nano time-to-first-token (ttft)

    Initial Impressions

    We ran a few standard tests to see how Nemotron 3 Nano handles common logic and coding prompts, including math reasoning tasks.

    Tokenization & Logic:

    Prompt: How many 'r's in “strawberry”?

    Nemotron 3:

    Nemotron Strawberry

    Prompt: How many 'l's in “strawberry”?

    Nemotron 3: The model got this one wrong by stating that the word “strawberry” had one letter ‘l’

    Nemotron NVIDIA

    Mathematical Reasoning:

    Prompt: Find all saddle points of the function $f(x, y) = x^3 + y^3 - 3x - 12y + 20$.

    Nemotron 3: The model correctly applied the second derivative test and identified the saddle point without errors in the calculation steps.

    Nemotron Math Performance

    Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections

    Nemotron 3: The model got this one wrong, because the correct answer is 3 and not 11/4

    Nemotron 3 Nano Math

    Code Generation:

    Prompt: We asked for Python code for the Snake game. Nemotron Nano 3 got it right at the first try with a perfect simulation of the game. Here’s a snapshot of the game

    Prompt: Create an SVG of a smiling dog

    Again, Nemotron 3 Nano 3 was able to create the SVG correctly on the first pass.

    Overall, Nemotron 3 Nano performs reliably for its size. It demonstrates strong logical consistency and multi-step reasoning capabilities, suggesting that the hybrid architecture is effective at maintaining coherence without the computational cost of a dense 70B model.

    Scale your AI on Ori

    Deploying hybrid models like Nemotron 3 benefits significantly from robust, high-performance infrastructure. Ori’s AI Cloud provides the flexibility and top-tier compute required to support dynamic workloads, helping you bridge the gap between initial prototyping and production deployment.

    • GPU Instances: Gain instant access to top-tier GPUs required for efficient inference.
    • Supercomputers: Instant, bare-metal GPU clusters connected by Infiniband networking.
    • Inference Endpoints: Integrate the latest open-source models into your applications via scalable, low-latency APIs.
    • GPU Clusters: Orchestrate high-performance clusters for fine-tuning or training foundation models at scale.
    • Ori AI Fabric: License the same platform that powers Ori AI Cloud to build your own AI-centric, GPU compute cloud


    Share