Tutorials

How to run Pixtral 12B on a cloud GPU with vLLM

Deepak Manoor
Posted : September, 19, 2024
Posted : September, 19, 2024

    In our previous tutorial, we explained how to run an AI image generator on an Ori cloud GPU to create images based on a text prompt. This tutorial will demonstrate another paradigm involving images and text – how to use text prompts to understand and interpret images. During the past couple of years, Mistral AI has emerged as one of the key providers of open source large language models (LLMs), with their Instruct v0.2 model topping more than a million downloads. Mistral recently announced their first multimodal model, Pixtral 12B which is trained to understand both images and text. Here’s a quick rundown of Pixtral 12B specifications:

    AttributePixtral 12B
    Architecture12B parameter Multimodal Decoder + 400M parameter Vision Encoder
    Image attributesVariable size and aspect ratios
    Sequence length128k tokens
    LicensingApache 2.0

    One of Pixtral’s key strengths is that it doesn’t compromise on key text capabilities such as instruction following, coding, and math, while excelling at image analysis at the same time. The native multimodal capability stems from how Pixtral was trained, with interleaved image and text data. Performance benchmarks shared by Mistral portray strong multimodal performance when compared to several other leading LLMs. 

    Pixtral Benchmark Performance

    Source: Mistral Blog

    Pixtral enables generative AI to be applied for a wide range of uses cases that includes the following and many more:

    • Optical Character Recognition (OCR) that is used in validating financial and legal documents, Captcha security systems etc.
    • Image captioning which is often used in ecommerce, social networking and many other internet applications.
    • Information extraction from images such as identifying number plate data, linking product images to SKUs etc.
    • Assistive technology for blind and visually impaired users.
    • Analyzing and mapping complex technical images.

    Connect with our team and other AI builders

    Join Ori on Discord

    How to run Pixtral on an Ori virtual machine

    Pre-requisites

    We’ll be using the vLLM utility to serve Pixtral for this demo. Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA H100 PCIe with 80 GB VRAM and 380 GiB of system memory for this demo because vLLM needs 24 GB VRAM to load the model, and some more memory for the graph. We recommend using an NVIDIA A100 or H100 GPU to run this model. For the operating system, we chose Ubuntu 22.04, however Debian is also an option. 

    Quick tip

    Quick TipDo not use the Init script when creating the VM if you plan to use vLLM because libraries in the vLLM utility are compiled with Cuda 12.1 whereas the init script installs the newer version, Cuda 12.6.

    Step 1: Once you SSH into your VM, install Cuda 12.1 and NVIDIA drivers
    Cuda 12.1:

    Bash/ShellCopy
    1wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
    2sudo dpkg -i cuda-keyring_1.1-1_all.deb
    3sudo apt-get update
    4sudo apt-get -y install cuda-toolkit-12-1

    NVIDIA Drivers

    Bash/ShellCopy
    1sudo add-apt-repository ppa:graphics-drivers/ppa --yes && sudo apt update && sudo apt install -y nvidia-driver-555

    You need not install Pytorch as it is bundled with vLLM.

    Step 2: If you haven’t done it already, install Python and create a virtual environment

    Bash/ShellCopy
    1apt install python3.10-venv
    2python3.10 -m venv pixtral-env
    3source pixtral-env/bin/activate
    4  

    Step 3: Install Hugging face CLI if you haven’t before and log in

    Bash/ShellCopy
    1pip install -U "huggingface_hub[cli]"
    2huggingface-cli login

    Step 4: Install vLLM and upgrade to the latest Mistral version

    Bash/ShellCopy
    1pip install vllm
    2pip install --upgrade mistral_common

    Step 5: Create a Python (.py) file with the following code to run vLLM from the terminal

    PythonCopy
    1from vllm import LLM
    2from vllm.sampling_params import SamplingParams
    3
    4model_name = "mistralai/Pixtral-12B-2409"
    5
    6sampling_params = SamplingParams(max_tokens=8192)
    7
    8llm = LLM(model=model_name, tokenizer_mode="mistral")
    9
    10prompt = "Describe this image in detail."
    11image_url = "https://cdn.pixabay.com/photo/2017/03/07/22/17/cabin-2125387_1280.jpg"
    12
    13messages = [
    14    {
    15        "role": "user",
    16        "content": [{"type": "text", "text": prompt}, {"type": "image_url", "image_url": {"url": image_url}}]
    17    },
    18]
    19
    20outputs = llm.chat(messages, sampling_params=sampling_params)
    21
    22print(outputs[0].outputs[0].text)

    Alternative ways to run Pixtral on the cloud

    Mistral-inference

    Use the mistral-inference library to run Pixtral on a cloud GPU or on a local machine. Mistral has shared a code example here.

    Jupyter notebook

    Install Jupyter Notebook and spin up a notebook on a port of your choice, we specified 8889 here

    Bash/ShellCopy
    1pip3 install notebook
    2jupyter notebook --port 8889 --allow-root --no-browser --ip=0.0.0.0

    The command will return an URL to the local machine. You could also replace the localhost string with your VM’s IP to access the Jupyter lab via your browser. Create a new notebook and run the Python code in it.

    How good is Pixtral?

    We threw a variety of questions at Pixtral regarding images, ranging from descriptions, to dominant colors, identification of people, birds, and animals. Pixtral aced these prompts except the odd occasion where it got a few details wrong, and only in one instance it failed to summarize the result in the number of sentences specified within the prompt.

    Here are some results of queries about open source images

    See Pixtral 12B in action

    Prompt: Describe this image in detail

    Pixtral Multimodal Example

    Image Source: Pixabay

    Result: The image captures a serene winter scene at a lake. The lake, a deep shade of blue, is the central focus of the image. It's surrounded by a forest of evergreen trees, their branches heavy with snow, creating a beautiful contrast against the clear sky. On the shore of the lake, there's a small wooden cabin. Its brown roof and white walls stand out against the snowy landscape. The cabin is situated on the right side of the image, with a red boat docked next to it, adding a pop of color to the otherwise monochrome scene. The sky above is a light blue, with a few clouds scattered across it. The entire scene is a harmonious blend of natural elements, each contributing to the peaceful ambiance of the winter landscape.

    Prompt: Identify the bird in the image.

    Pixtral AI Image

    Image Source: NPS

    Result: The bird in the image appears to be a ptarmigan. This species is known for its plumage which changes with the seasons. In the winter, ptarmigan birds have white feathers to blend in with the snowy environment, while in the summer, their feathers change to a more brown or gray color to blend in with the rocky and grassy terrain. The bird shown in the image has white feathers, suggesting it is in a winter camouflage mode. Ptarmigans are often found in arctic and alpine regions and are well-adapted to cold environments.
    Prompt: Identify the animal in the image.

    Pixtral AI Image

    Image Source: NPS

    Result: The animal in the image is a mountain goat. 

    Prompt: What is the dominant color in this image?

    Pixtral Color Classification AI

    Image Source: NPS

    Result: The dominant color in this image is white, which is mostly due to the extensive snow coverage

    Prompt: How many people in this image? 

    Pixtral Multimodal Performance

    Image Source: Pixabay

    Result: There are three people in the image.

    Prompt: What colored shirt is the person in this image wearing? 

    Mistral Pixtral

    Image Source: Pixabay

    Result: The person in the image is wearing a black shirt.

    Prompt: Translate to English

    Pixtral Translation

    Image Source: Pixabay

    Result: Sure, here is the translation of the text from German to English:

    "The text reads:
    'Dear Guest...
    Youare welcome!'"

    So, the whole translation is:
    "Dear Guest...
    You are welcome!"

    Although the model excelled with the majority of the queries, it did fall short with the following prompt where we it took two sentences to describe an image instead of the one sentence we asked for.
    Prompt: Describe this image in one sentence 

    Pixtral 12B AI

    Image Source: NPS

    Result: The image depicts a rugged and rocky landscape dominated by a towering mountain range, with a mix of barren, steeper slopes and greener, gentler foothills bathed in sunlight and shadow. The valleys below appear muddy and wet, suggesting a riverbed or recent water activity, under a partly cloudy sky with patches of blue.

    Imagine another AI reality. Build it on Ori.

    Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori helps AI startups & businesses to:

    Share