How to run Llama 4 on a cloud GPU with Transformers and vLLM

Meta’s Llama foundation models have accelerated AI innovation, empowering countless developers and startups with unprecedented access. With over a billion downloads, the Llama series has emerged as the most widely adopted open-source AI model ecosystem. In this article, we’ll demonstrate how to deploy Meta’s new generation of foundation models, Llama 4, on the Ori AI cloud, and provide a comparison with its predecessor, the multimodal Llama 3.2. Here’s a brief overview of Llama 4’s key specifications:

Llama 4 Scout and Maverick
Architecture	Mixture-of-Experts (MoE), Natively multimodal (Early fusion integrates text and visual tokens for pre-training)
Model Variants	Scout (16 experts with Instruction-tuned and Base versions) Maverick (128 experts with Instruction-tuned and Base versions)
Parameters	Scout: 17B active parameters out of ~109B total, Maverick: 17B active parameters out of ~400B total
Capabilities	Instruction-tuned: Optimized for code generation, visual reasoning, document summarization, multimodal assistant tasks
Sequence length	Scout (10 million for 16E Instruct, 256k for 16E) Maverick (1 million for 128E Instruct, 256k for 128E)
Licensing	Llama 4 Community: Commercial and research

Meta's Llama 4 models demonstrate notable performance across various AI benchmarks compared to leading models like OpenAI's GPT-4o and Google's Gemini 2.0 Flash. Llama 4 utilizes a Mixture-of-Experts (MoE) architecture to enhance efficiency and scalability. Unlike dense models like Llama 3, where all parameters are activated for each token, Llama 4 activates only a subset of specialized "experts" per token. For instance, Llama 4 Scout comprises 16 experts with a total of 109 billion parameters but activates only 17 billion parameters per token during inference, reducing computational costs significantly while maintaining high performance. This design allows Llama 4 to achieve comparable or superior results to larger dense models, such as OpenAI's GPT-4o, but with lower inference costs and improved scalability.

Llama 4 Scout features a 10-million-token context window, significantly surpassing GPT 4o's 128K tokens and Gemini 2.0 Flash's 1 million tokens. This extensive context capability enables Scout to handle long-context tasks effectively.

Llama 4 Maverick, with its 17 billion active parameters, excels in multimodal reasoning and coding tasks. These results highlight Llama 4's advancements in handling complex tasks and extended contexts, positioning it as a strong contender in the AI model landscape.

Source: Meta

Connect with our team and other AI builders

Join Ori on Discord

How to run Llama 4 Scout with Hugging Face Transformers on an Ori virtual machine

Pre-requisites

Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the 4x NVIDIA H100 SXMs with 80 GB VRAM and 90GiB of system memory for this demo. However, it is recommended to use 8x H100s to achieve the full context window. We’ve chosen Ubuntu 22.04 as our OS, however Debian is also an option. For this demo we’ll be loading the Llama 4 Scout Instruct model.

Quick tip

Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow are preinstalled for you.

Step 1: SSH into your VM, install Python and create a virtual environment

Bash/ShellCopy

1apt install python3.10-venv
2python3.10 -m venv llama-env
3source llama-env/bin/activate

Step 2: Install Pytorch if you didn’t use the corresponding init script

Bash/ShellCopy

1pip3 install torch torchvision torchaudio

Step 3: Install Hugging face CLI and log in

Bash/ShellCopy

1pip install -U "huggingface_hub[cli]"
2huggingface-cli login

Step 4: Install Transformers and other dependencies

Bash/ShellCopy

1pip install transformers==4.51.0
2pip install accelerate
3pip install hf_xet
4pip install auto-gptq bitsandbytes

Quick tip

We ran into issues using the 4.51.1 version of Transformers, which we did not observe in 4.51.0

Step 5: Spin up a Jupyter server and open it on your browser with your VM’s IP address

Bash/ShellCopy

1pip install notebook
2jupyter notebook --allow-root --no-browser --ip=0.0.0.0

Step 6: Create a notebook and load the model with this script

PythonCopy

1from transformers import AutoProcessor, Llama4ForConditionalGeneration
2import torch
3
4model_id = "meta-llama/Llama-4-Scout-17B-16E-Instruct"
5
6processor = AutoProcessor.from_pretrained(model_id)
7
8model = Llama4ForConditionalGeneration.from_pretrained(
9   model_id,
10   attn_implementation="sdpa",
11   device_map="auto",
12   torch_dtype=torch.bfloat16,
13)

Note: We ran into errors using Flex Attention, as an alternative we chose SDPA which ran quite smoothly. This will take a while since the weights are getting loaded for the first time

Here’s a snapshot of our memory usage

Let’s analyze two images and understand their similarities and differences, one is an image of a beach the other is that of a mountain. This code snippet is based on the example provided by Hugging Face here.

PythonCopy

1url1 = "https://cdn.pixabay.com/photo/2019/03/02/18/43/beach-4030372_1280.jpg"
2url2 = "https://cdn.pixabay.com/photo/2021/06/28/04/46/mountain-6370590_1280.jpg"
3messages = [
4    {
5        "role": "user",
6        "content": [
7            {"type": "image", "url": url1},
8            {"type": "image", "url": url2},
9            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
10        ]
11    },
12]
13
14inputs = processor.apply_chat_template(
15    messages,
16    add_generation_prompt=True,
17    tokenize=True,
18    return_dict=True,
19    return_tensors="pt",
20).to(model.device)
21
22outputs = model.generate(
23 **inputs,
24    max_new_tokens=1024,
25)
26
27response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
28print(response)

Here’s the response from Llama 4 Scout:

This example can be scaled to compare multiple images for visual-intensive workloads.

Trying out Llama 4’s Industry-leading Context Length with vLLM

One of Llama 4’s standout features is its super long context window (10 million for Scout). Since we only used 4 GPUs we couldn’t try out the full context length and managed to do only about 200,000 tokens which is still higher than other models. Step 1: Install vLLM

Bash/ShellCopy

1pip install -U vllm
2pip install flashinfer

Step 2: Run this script (based on a Llama cookbook) in a notebook or from the terminal

PythonCopy

1import os
2from vllm import LLM, SamplingParams
3
4#Read in our example file
5def read_file_to_string(file_path):
6    try:
7        with open(file_path, "r") as file:
8            content = file.read()
9            return content
10    except FileNotFoundError:
11        print(f"File {file_path} not found.")
12        return "File_Path_Error"
13
14#Please remember to set `attn_temperature_tuning` to `True` for best long context performance
15def load_llm():
16    llm = LLM(
17        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
18        enforce_eager=False,
19        tensor_parallel_size=4,
20        max_model_len=200000,
21        override_generation_config= {
22        "attn_temperature_tuning": True,
23        }
24    )
25
26    return llm
27llm = load_llm()
28file_content = read_file_to_string("/root/book.txt")
29PROMPT = f"""Write a couple of paragraphs about Anne's house and her environs\n\n\n{file_content} """
30print("Showing long content")
31if len(file_content) > 100:
32    print(file_content[:100])
33else:
34    print(file_content)
35
36conversations = [
37    [
38        {
39            "role": "user",
40            "content": PROMPT
41        }
42    ],
43]
44
45# Create a sampling params object.
46sampling_params = SamplingParams(temperature=1, top_p=0.95, max_tokens=4000)
47
48# Remember to use `chat` function and not `generate` :)
49outputs = llm.chat(conversations, sampling_params)
50
51for output in outputs:
52    prompt = output.prompt
53    generated_text = output.outputs[0].text
54    print(f" Generated text: {generated_text}")

We tested out the needle in the haystack approach with a classic novel, Anne of Green Gables (word count of 114,000)

Prompt: Write a couple of paragraphs about Anne's house and her environsLlama 4 Response:

How good is Llama 4 Scout and how does it compare with Llama 3.2 11B

We ran a series of visual language (VL) prompts to put Llama 4 through the paces.

Image Source

Prompt: How many types of berries are there in this image?

Llama 4 Response: Llama responded incorrectly as 4 types of berries.

Llama 3.2 Response: There are 3 types of berries in this image: raspberries, blueberries, and red currants. In this test, Llama 3.2 11B was more accurate than Llama 4.

Image Source

Prompt: what is the mountain in this image?

Llama 4 Response: The model gave us a response where it broadly identified the geography, without naming the exact mountain peak

Llama 4 vs Llama 3.2 Image Identification

Llama 3.2 Response: The mountain in this image is Kudremukh, a mountain range located in the Western Ghats of Karnataka, India. We see better accuracy from Llama 3.2, since this image depicts the Kudremukh peak.

Image Source

Prompt: What is the maximum speed at this intersection?

Llama 4 Response: The speed limit at this intersection is 20. This information is clearly indicated on the sign to the right of the railroad crossing sign, which states "SPEED LIMIT 20".<|eot|>

Llama 3.2 Response: The maximum speed at this intersection is 20 miles per hour. Although no units are specified in the image, Llama 3.2 has the more accurate response because the image is from Alaska, USA where mph is used.

Image Source

Prompt: What is the maximum speed on this road?

Llama 3.2 Response: The maximum speed on this road is 100 kilometers per hour.

Llama 4 Response: The maximum speed limit on this road is 100 kilometers per hour. This information is clearly indicated by the speed limit sign on the right side of the road, which features a red circle with a white interior and the number "100" in black text.

Image Source

Prompt: Suggest a way to reduce the total cost of the project by 20%

Llama 4 Response: **Step 1: Calculate the target cost and the current total cost**The current total cost of the project is $72.31.To reduce the total cost by 20%, we need to calculate the target cost:Target cost = Current total cost - (20% of current total cost)= $72.31 - (0.20 * $72.31)= $72.31 - $14.462= $57.85**Step 2: Analyze the cost of each item**| Description | Cost (Auto calculated) || :---------- | :--------------------- || 3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack)(Base of the brdige) | $23.76 || 3 ft. x 1/4 in. Basswood Square Dowel (for sides and other small uses) | $5.52 || 3/8 in. x 3/8 in. Poplar Square Dowel (for sides and other small uses) | $5.88 || Syringe100ml and 120cm handy plastic tubing (hydraulic system) | $17.70 || Syringe 100CC/ML (hydraulic system) | $15.98 || 8 oz. Carpenter's Wood Glue | $3.47 |**Step 3: Identify the most cost-effective way to reduce the total cost**To reduce the total cost by 20%, we can consider the following options:* Negotiate a discount on the most expensive items: The most expensive item is the 3/4 in. x 2 ft. x 4 ft. Pine Plywood (2-Pack) with a cost of $23.76. Negotiating a 20% discount on this item would save $4.75.* Find cheaper alternatives: We can look for cheaper alternatives for the items. For example, we can consider using a different type of wood for the dowels or find a cheaper supplier for the syringes and plastic tubing.* Reduce the quantity of items: We can review the quantity of each item and see if we can reduce it without affecting the project. For example, we can consider using one syringe and plastic tubing set instead of two.**Step 4: Suggest a way to reduce the total cost by 20%**One possible way to reduce the total cost by 20% is to find cheaper alternatives for the syringes and plastic tubing. We can consider using a different supplier or a similar product with a lower cost.For example, if we can reduce the cost of the Syringe100ml and 120cm handy plastic tubing (hydraulic system) from $17.70 to $10.00 and the cost of the Syringe 100CC/ML (hydraulic system) from $15.98 to $8.00, the total cost would be:$23.76 + $5.52 + $5.88 + $10.00 + $8.00 + $3.47 = $56.63This represents a reduction of $15.68 (from $72.31 to $56.63), which is approximately 21.7% of the original cost.Therefore, finding cheaper alternatives for the syringes and plastic tubing could be a way to reduce the total cost of the project by 20%.<|eot|>

Llama 3.2 Response:To reduce the total cost of the project by 20%, we need to calculate the total cost and then find ways to lower it. The total cost is $72.31. **Step 1: Calculate the 20% reduction amount**$72.31 * 0.20 = $14.46**Step 2: Find ways to reduce the cost**The cost of the 8 oz. Carpenter's Wood Glue is $3.47. Reducing this item to $2.78 would save $0.69, which is more than the reduction amount needed.**Step 3: Calculate the new total cost**$72.31 - $0.69 = $71.62 By reducing the cost of the 8 oz. Carpenter's Wood Glue to $2.78, the total cost of the project can be reduced by 20%. Llama 3.2 was incorrect in its calculations because it reduced the price of a single component instead of taking the whole project account.

In this scenario, Llama 4 was better at combining math and visual analysis, showing an area of strength when compared to 3.2.

Prompt: Write a Python program that shows a ball bouncing inside a spinning hexagon. The ball should be affected by gravity and friction, but it must bounce off the rotating walls realistically.

The code generated by Llama 4 generated a zero division error.

Image Source

Prompt: How many people in this image?

Llama 4 Response: There are three people in this image. The first person is a woman with long brown hair, wearing a black blazer and maroon pants, sitting on a couch and working on a laptop. The second person is another woman with long dark hair, wearing a pink shirt and black pants, also sitting on a couch and working on a laptop. The third person is a man with short brown hair, wearing a blue suit jacket, standing by a counter and drinking from a cup.<|eot|>

Llama 3.2 Response: There are three people in this image. One person is standing and two are sitting.

In this scenario, we see that Llama 4 is unnecessarily verbose.

Llama 4 Language Support and Translation

Image Source

Prompt: Translate to EnglishBoth Llama 4 and Llama 3.2 translated the German text correctly as “Dear Guest, Welcome!”

Overall, Llama 4 Scout delivered mixed performance in our model tests, it does well at math but performs worse when compared to Llama 3.2 in certain image analysis scenarios. However, Llama 4’s massive context window opens up new avenues in terms of applying multimodal and vision models.

We’re looking forward to trying out Llama 4 Behemoth, a 288 billion active parameter model with 16 experts that is currently being trained.

Let your AI world span with Ori

Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways:

Leverage GPU Instances as on-demand virtual machines.
Deploy Private Clouds for flexible and secure enterprise AI.
Operate Inference Endpoints effortlessly at any scale.
Scale GPU Clusters for training and inference.
Manage AI workloads on Serverless Kubernetes without infrastructure overhead.

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.