Tutorials

How to run Mistral Small 3 on a cloud GPU with vLLM

Deepak Manoor
Posted : February, 5, 2025
Posted : February, 5, 2025
    Mistral Small 3

    Mistral AI has established itself as a key player in the generative AI space by developing open, portable, and highly customizable models. A defining strength of their model portfolio is the ability to deliver exceptional performance with relatively compact models, such as the Pixtral 12B and the Ministral series. With the launch of Mistral’s Small 3 model, they have taken it a step further by delivering a compact yet performant model for a variety of language, coding and math tasks.

    Mistral Small 3 supports multilingual inputs and output in 11 languages: English, French, German, Spanish, Italian, Chinese, Japanese, Korean, Portuguese, Dutch, and Polish. Mistral is looking to target a range of business use cases with this model: fraud detection, virtual customer service, sentiment analysis, feedback systems, customer triaging, and more.

    Here’s a quick rundown of Mistral Small 3 24B specifications:

    Mistral Small 3 24B
    Mistral Small 3 24B ArchitectureDesigned for low latency, Tekken Tokenizer with a 131k vocabulary size
    VariantsPretrained Base model and Instruction tuned
    Parameters24 billion
    Context Window32k tokens
    LicensingApache 2.0: Commercial and Non-commercial

    Benchmark results shared by Mistral indicate that Small 3 exceeds GPT4o-mini and Gemma 2 27B, while being close to Llama 3.3 70B and Qwen 2.5 32B in terms of performance.

    Mistral Small 3 Performance

    Connect with our team and other AI builders

    Join Ori on Discord

    How to use Mistral Small 3 with vLLM on an Ori virtual machine

    Prerequisites

    Create a GPU virtual machine (VM) on Ori Global Cloud. We chose the NVIDIA H100 SXM GPU because the vLLM server for Mistral Small 3 Instruct in BF16 format needs about 71 GB of memory, which includes model weights, Pytorch activation and KV cache. If you’re using smaller, quantized versions of the model, we recommend using the NVIDIA L40S GPU for its strong inference performance.

    Quick tip

    Use the init script when creating the VM so NVIDIA CUDA drivers, frameworks such as Pytorch or Tensorflow and Jupyter notebooks are preinstalled for you.

    Step 1: SSH into your VM, install Python and create a virtual environment

    Bash/ShellCopy
    1apt install python3.11-venv
    2python3.11 -m venv mistral-env

    Activate the virtual environment

    Bash/ShellCopy
    1source mistral-env/bin/activate

    Step 2: Install the Hugging Face CLI if you haven’t before and log in. Make sure to have your Hugging Face token ready and to have accepted model terms of conditions.

    Bash/ShellCopy
    1pip install -U "huggingface_hub[cli]"
    2huggingface-cli login

    Step 3: Install vLLM and update tokenizer

    Bash/ShellCopy
    1pip install vllm
    2pip install --upgrade mistral_common

    Step 4: Load the model and run the vLLM server. We chose the Mistral Small 24B Instruct variant for its better performance. 

    Bash/ShellCopy
    1vllm serve "mistralai/Mistral-Small-24B-Instruct-2501"

    The default port for vLLM is 8000, however you can use the –port option to change it.

    Note: It'll take a while to load all the model shards for the first time. Step 6: Query the model in another terminal window and use the JSON utility JQ to format model responses

    Bash/ShellCopy
    1sudo apt install jq
    2curl -s "http://localhost:8000/v1/chat/completions" \
    3     -H "Content-Type: application/json" \
    4     -d '{
    5        "model": "mistralai/Mistral-Small-24B-Instruct-2501",
    6        "temperature": 0.15,
    7        "messages": [
    8           {"role": "user", "content": "How many times does the letter r appear in the word strawberry?"}
    9        ]
    10    }' | jq -r '."choices"[0]."message"."content"'

    How good is Mistral Small 3?

    We threw down a few verbal, math and coding reasoning challenges at Mistral Small 3 24B. The model did quite well with verbal intelligence, however at times it struggled with our coding and math questions.

    Prompt: What is the best French cheese? Mistral Small 3 handled this subjective prompt quite well, it laid out some of the options and their strengths, and let the person asking the question decide based on these attributes.

    Mistral Small Verbal

    Prompt: Translate this sentence to French, German and Italian Visit our website to learn more about end-to-end AI infrastructure. Go to www.ori.co.

    Mistral language

    The model understood the question correctly and translated the statement with reasonable accuracy.

    Prompt: How many ‘r’s in “strawberry” ? 

    The model got the answer to this question wrong because the word strawberry has 3 ‘r’s.

    Mistral Strawberry

    Prompt: How many ‘l’s in strawberry”?

    Mistral Small Strawberry

    Mistral Small 3 answered incorrectly to this prompt too.

    Prompt: Write the game Snake in Python

    Creating a simple version of the classic Snake game in Python can be a fun and educational project. We'll use the `tkinter` library for the graphical interface and `random` for generating food positions.

    Below is a basic implementation of Snake using `tkinter`

    PythonCopy
    1import tkinter as tk
    2
    3import random
    4
    5class SnakeGame:
    6
    7    def __init__(self, master):
    8
    9        self.master = master
    10
    11        self.master.title("Snake Game")
    12
    13        self.master.resizable(False, False)
    14
    15        self.canvas = tk.Canvas(self.master, width=400, height=400, bg="black")
    16
    17        self.canvas.pack()
    18
    19        self.snake = [(100, 50), (90, 50), (80, 50)]
    20
    21        self.food = self.spawn_food()
    22
    23        self.direction = "Right"
    24
    25        self.canvas.bind_all("<KeyPress>", self.change_direction)
    26
    27        self.game_loop()
    28
    29    def change_direction(self, event):
    30
    31        if event.keysym == "Up" and self.direction != "Down":
    32
    33            self.direction = "Up"
    34
    35        elif event.keysym == "Down" and self.direction != "Up":
    36
    37            self.direction = "Down"
    38
    39        elif event.keysym == "Left" and self.direction != "Right":
    40
    41            self.direction = "Left"
    42
    43        elif event.keysym == "Right" and self.direction != "Left":
    44
    45            self.direction = "Right"
    46
    47    def game_loop(self):
    48
    49        head = self.snake[0]
    50
    51        if self.direction == "Up":
    52
    53            new_head = (head[0], head[1] - 10)
    54
    55        elif self.direction == "Down":
    56
    57            new_head = (head[0], head[1] + 10)
    58
    59        elif self.direction == "Left":
    60
    61            new_head = (head[0] - 10, head[1])
    62
    63        elif self.direction == "Right":
    64
    65            new_head = (head[0] + 10, head[1])
    66
    67        self.snake.insert(0, new_head)
    68
    69        # Check for collisions
    70
    71        if (
    72
    73            new_head in self.snake[len(self.snake) - 1:len(self.snake)]  # Self collision
    74
    75            or new_head[0] < 0
    76
    77            or new_head[0] >= 400
    78
    79            or new_head[1] < 0
    80
    81            or new_head[1] >= 400
    82
    83        ):
    84            self.game_over()
    85
    86            return
    87
    88        if new_head == self.food:
    89
    90            self.food = self.spawn_food()
    91
    92        else:
    93
    94            self.snake.pop()
    95
    96        self.canvas.delete("all")
    97
    98        self.draw_snake()
    99
    100        self.canvas.create_oval(self.food[0], self.food[1], self.food[0] + 10, self.food[1] + 10, fill="red")
    101
    102
    103        self.master.after(100, self.game_loop)
    104
    105    def draw_snake(self):
    106
    107        for segment in self.snake:
    108
    109            self.canvas.create_rectangle(
    110
    111                segment[0], segment[1], segment[0] + 10, segment[1] + 10, fill="white"
    112
    113            )
    114
    115    def spawn_food(self):
    116
    117        while True:
    118
    119            food = (random.randint(0, 39) * 10, random.randint(0, 39) * 10)
    120
    121            if food not in self.snake:
    122
    123                return food
    124
    125    def game_over(self):
    126
    127        self.canvas.create_text(200, 200, text="Game Over", font=("Helvetica", 24), fill="white")
    128
    129if __name__ == "__main__":
    130
    131    root = tk.Tk()
    132
    133    game = SnakeGame(root)
    134
    135    root.mainloop()

    Prompt: What is Agentic AI? Explain like I’m five.

    Mistral Agentic AI

    Prompt: Write the Tetris game in Python.

    The code generated by Mistral failed to run at first pass.

    Prompt: Give me a concise summary of the Collatz_conjecture.

    Mistral Collatz Conjecture

    Prompt: Compute the area of the region enclosed by the graphs of the given equations “y=x, y=2x, and y=6-x”. Use vertical cross-sections.

    Problem source: UC Davis

    Mistral Small Math1

    Mistral Small Math2

    The answer provided by Mistral Small 3 is incorrect. The correct answer is 3 (or 3 square units).

    Prompt: Quelles sont les sept merveilles du monde?

    Mistral Translation

    Overall, our perception of Mistral Small 3 was that it excelled at verbal analysis and translation, but delivered mixed results for coding and math prompts.

    Chart your own AI reality with Ori

    Ori Global Cloud provides flexible infrastructure for any team, model, and scale. Backed by top-tier GPUs, performant storage, and AI-ready networking, Ori enables growing AI businesses and enterprises to deploy their AI models and applications in a variety of ways:

    • GPU instances, on-demand virtual machines backed by top-tier GPUs to run AI workloads.
    • Inference Endpoints to run and scale your favorite open source models with just one click.
    • GPU Clusters to train and serve your most ambitious AI models.
    • Serverless Kubernetes helps you run inference at scale without having to manage infrastructure.
    • Private Cloud provides a single-tenant environment, top-notch performance and secure infrastructure for enterprise AI builds.

    Share