Tutorials

Accelerate Llama 3.1 8B Instruct Inference with TensorRT LLM

Posted : January, 3, 2025
Posted : January, 3, 2025
    Ori Tutorials

    There’s so much hype around the inference speeds achieved by TensorRT LLM, but it’s tough to know where to get started when optimising your own LLM deployment. Here we provide a complete guide to building a TensorRT LLM engine and deploying an API to batch requests on Ori’s Virtual Machines.

    An Introduction to TensorRT and the LLM API

    TensorRT is an SDK developed by NVIDIA for high-performance deep learning inference. It provides optimisations, runtime, and deployment tools that significantly accelerate AI applications, particularly when running on NVIDIA GPUs. Nvidia reports performance with “TensorRT-based applications perform up to 36X faster than their CPU-only platform during inference.” Third party benchmarks have also verified TensorRT output performs other inferencing engines such as the last engine we benchmarked with BeFOri: vLLM.

    Key Features of TensorRT:

    • Compiled Inferencing Engine: Developers are able to compile a model into an optimised C++ TensorRT engine through a simple Python library (without interacting with C++) that runs much faster than the raw weights.
    • Optimised Inferencing Performance: Developers are able to apply optimisation techniques such as quantization, layer and tensor fusion, and kernel tuning through parameters when building the TensorRT engine.
    • Automated Dynamic Batching : Developers can rely on the TensorRT engine to efficiently manage memory and handle varying input sizes and batch dimensions enabling auto scaling.

    These result in the highest throughput and lowest latency inference, providing you with both super fast results and potentially substantial cost savings.

    The TensorRT LLM API

    The TensorRT SDK includes text, audio, and video modals but here we will focus on optimisation of text to text generation using the LLM class in the tensorrt_llm library. The API is currently under development, and the documentation is sparse, but we were able to modify the Generate Text in Streaming example provided in the TensorRT-LLM git repo to successfully deploy an API wrapping the TensorRT Engine for Llama3.1 8B Instruct. 

    Limitations of TensorRT

    As mentioned, the TensorRT LLM API is currently under development, and there may be breaking changes in the future. So there is no doubt Nvidia will roll out some exciting enhancements in 2025, but for now, these are the challenges we ran into.

    Limited Model support

    At the time of writing the supported models include:

    • Llama (including variants Mistral, Mixtral, InternLM)
    • GPT (including variants Starcoder-1/2, Santacoder)
    • Gemma-1/2
    • Phi-1/2/3
    • ChatGLM (including variants glm-10b, chatglm, chatglm2, chatglm3, glm4)
    • QWen-1/1.5/2
    • Falcon
    • Baichuan-1/2
    • GPT-J
    • Mamba-½

    TensorRT requires significant effort to optimise and validate models for performance and precision across its supported quantisations. Each new model must be tailored to ensure compatibility with TensorRT's kernel operations and its inference engine, which requires substantial time and resources.

    Lack of Documentation and Tutorials for TensorRT-LLM

    The TensorRT-LLM git repo contains a multitude of examples, primarily organised by model; however the lack of documentation, dozens of command line arguments, and thousands of lines of messy code make it tough to decipher.

    It’s best to start with the directory TensorRT-LL/examples/llm-api if your model is supported by the LLM API, otherwise you’ll need to navigate to the model’s directory under TensorRT-LL/examples/ and work through the steps in the Quick Start Guide to:

    1. Compile the Model into a TensorRT Engine
      1. Convert the checkpoint
      2. Build the engine
    2. Run the model
    3. Deploy the model

    However if your model is supported by the LLM API you're in luck - read on and follow the tutorial provided below.

    Streaming Batch Responses

    It appears the streaming and batching functionalities are not compatible at this time. While the Generate Text in Streaming example provided in the TensorRT-LLM git repo claims the results will print out 1 token at a time, this was not the behaviuor we observed when running it ourselves. You can pass the parameter streaming=True to the TensorRT runner.generate() function and successfully generate a response, but there does not appear to be built in functionality to consume those tokens as they are streamed back.

    The TensorRT engine does not accept a streamer parameter such as the TextIteratorStreamer from the transformers library commonly used with vLLM to consume streaming responses as you might expect. This makes it challenging to consume streaming responses, especially when a batch of requests are generated concurrently.

    There are currently 16 open issues with streaming in various parts of the repo, and it appears this functionality is certainly still under development. For the time being we moved forward with a batch inference tutorial instead.

    TensorRT LLM Tutorial:

    You will need to sign up to Ori's Public Cloud, and request access to the Meta Llama3.1 models on Hugging Face before completing these steps.

    1. Create a VM on Ori's Public Cloud

    Log into the Ori Console, navigate to the Virtual Machines page and create a new instance. When you reach the option to add an init script, copy and paste the appropriate script from below:

    Init Script for H100 SXM VM:

    Init Script for A100 VM:

    #!/bin/bash

    sudo apt update && \
    sudo apt upgrade wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
    sudo dpkg -i cuda-keyring_1.1-1_all.deb && \
    sudo apt-get update && sudo apt-get -y install cuda-toolkit-12-1 && \
    sudo apt-get install -y nvidia-driver-555-open && \
    sudo apt-get install -y cuda-drivers-555 && echo "blacklist nvidia_uvm" | sudo tee /etc/modprobe.d/nvlink-denylist.conf && \
    echo "options nvidia NVreg_NvLinkDisable=1" | sudo tee /etc/modprobe.d/disable-nvlink.conf && \
    sudo apt install nvidia-cuda-toolkit && \
    sudo update-initramfs -u && \
    sudo apt upgrade && \
    sudo reboot

    #!/bin/bash

    sudo apt update && \
    sudo apt upgrade wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb && \
    sudo dpkg -i cuda-keyring_1.1-1_all.deb && \
    sudo apt-get update && sudo apt-get -y install cuda-toolkit-12-1 && \
    sudo apt-get install -y nvidia-driver-555-open && \
    sudo apt-get install -y cuda-drivers-555 && \
    sudo apt install nvidia-cuda-toolkit && \
    sudo update-initramfs -u && \
    sudo apt upgrade && \
    sudo reboot

    It will take up to 10 minutes for your machine to be provisioned and become available.

    2. Install Dependencies

    You can copy the ssh command directly from the Ori Console to connect to your machine, and then run the following commands:

    # Verify init script installation
    nvidia-smi
    nvcc --version
    cat /proc/driver/nvidia/version
    nvidia-smi -q | grep -A5 Fabric
    # Expect NAs in response

    # Setup venv and activate
    sudo apt install python3.10-venv && python3 -m venv tensorrt
    source tensorrt/bin/activate

    # Python Package Index Installation
    wheel && python3 -m pip install --upgrade tensorrt
    python3 -m pip install --upgrade pip

    # Install TensorRT
    wget https://developer.nvidia.com/downloads/compute/machine-learning/tensorrt/10.5.0/local_repo/nv-tensorrt-local-repo-ubuntu2204-10.5.0-cuda-11.8_1.0-1_amd64.deb && \
    sudo dpkg -i nv-tensorrt-local-repo-ubuntu2204-10.5.0-cuda-11.8_1.0-1_amd64.deb && \
    sudo cp /var/nv-tensorrt-local-repo-ubuntu2204-10.5.0-cuda-11.8/nv-tensorrt-local-EE22FB8A-keyring.gpg /usr/share/keyrings/ && \
    sudo cp /var/nv-tensorrt-local-repo-ubuntu2204-10.5.0-cuda-11.8/*-keyring.gpg /usr/share/keyrings/ && \
    sudo apt-get install tensorrt

    sudo apt install libmpich-dev && sudo apt-get install libopenblas-base libopenmpi-dev libomp-dev

    sudo apt install python3-dev

    # Enter "Y" when prompted for permission

    To install the required python libraries, create a requirements.txt file containing the following:

    fastapi==0.115.4
    huggingface-hub
    ray==2.11.0
    ray[serve]==2.11.0
    tensorrt_llm==0.15.0.dev2024111200
    torch==2.5.1
    transformers==4.43.4
    wheel==0.43.0

    Then install the libraries and log into the HuggingFace CLI using the access token associated with the account you used to request permission the the Llama3.1 models:

    pip install -r requirements.txt

    huggingface-cli login --token "<your-access-token>"

    3. Create a Fast API App to wrap the TensorRT LLM Engine

    Create a python file called deploy_tensorrt_engine.py that contains:

    from tensorrt_llm import LLM, SamplingParams
    import logging
    from fastapi import FastAPI, HTTPException
    from ray import serve
    from itertools import islice
    from typing import Dict
    import uuid
    import time
    import threading

    logger = logging.getLogger("ray.serve")

    fastapi_app = FastAPI()


    @serve.deployment(ray_actor_options={"num_gpus": 1})
    @serve.ingress(fastapi_app)
    class DeployTRTEngine:

    def __init__(self, model_id: str, ccr: int, batch_time=1.0):
    self.model = LLM(model=model_id)

    self.queue = {}
    self.statuses = {}
    self.outputs = {}

    self.ccr = ccr
    self.batch_time = batch_time
    self.timer = 0

    @fastapi_app.post("/")
    def handle_request(self, prompt: str):

    # If the queue is empty then (re)start the timer
    queue_len = len(self.queue)
    if queue_len == 0:
    self.timer = time.time()

    # Generate a unique ID for the request and set up tracking
    task_id = str(uuid.uuid4())

    self.queue[task_id] = prompt
    queue_len += 1
    self.statuses[task_id] = "in queue"

    # If we have the desired number of concurrent requests
    # or 2 seconds have passed then start generating
    if queue_len >= self.ccr or time.time() - self.timer > self.batch_time:
    # make a dictionary of prompts that contain the desired number of
    # concurrent requests or less
    prompt_len = min(self.ccr, queue_len)
    prompts_dict = dict(islice(self.queue.items(), prompt_len))

     # remove them from the queue
    self.queue = dict(islice(self.queue.items(), prompt_len, None))

    # update statuses
    self.statuses = {
    key: ("in progress" if key in prompts_dict else value)
    for key, value in self.statuses.items()
    }

    # Start a background thread to process the task
    threading.Thread(
    target=self.generate_text, kwargs={"prompts": prompts_dict}
    ).start()
    return {"task_id": task_id}

    def generate_text(self, prompts: Dict[str, str]):
    prompt_list = list(prompts.values())

    # Generate Outputs
    raw_outputs = self.model.generate(prompt_list)

    # Process Outputs
    for _output in raw_outputs:
    _task_id, input_prompt = next(iter(prompts.items()))
    prompts.pop(_task_id)
    self.outputs[_task_id] = {
    "prompt": input_prompt,
    "text": _output.outputs[0].text,
    "token_len": len(_output.outputs[0].token_ids),
    }

    self.statuses[_task_id] = "complete"

    @fastapi_app.get("/response/{task_id}")
    def get_response(self, task_id: str):
    # Get the status of the task, if the task id is not found raise an error
    try:
    status = self.statuses[task_id]
    except KeyError:
    raise HTTPException(status_code=404, detail="Task ID not found")

    # Return 202 if its not done generating yet
    if status in ["in queue", "in progress"]:
    raise HTTPException(status_code=202, detail=f"Task is {status}.")
    ret = self.outputs.pop(task_id)
    self.statuses.pop(task_id)
    return ret


    app = DeployTRTEngine.bind("meta-llama/Meta-Llama-3.1-8B-Instruct", 2, 2.0)

    You may need to adjust line 16 `@serve.deployment(ray_actor_options={"num_gpus": 1})` to ensure the correct number of GPUs are made available to the app.

    You can also modify the last line to adjust the parameters:

    • model_id can be changed from "meta-llama/Meta-Llama-3.1-8B-Instruct" to the model id from Hugging Face for any supported model
    • ccr can be changed from 2 to any number of desired concurrent requests, this acts a maximum limit on the batch size that will be sent to the engine for concurrent generation
    • batch_time can be changed from 2.0 to any number of seconds to wait for additional requests before sending a batch of prompts to the engine for concurrent generation

    Deploy the app by running:

    serve run deploy_tensorrt_engine:app

    4. Call the TensorRT Engine API

    Open a new terminal window and open a second connection to the VM by running the same ssh command found in the Ori console at the beginning.

    Below is a simple Python script that sends a single prompt to the API and poles every 0.1 seconds until it receives a response, and then prints the response:

    import requests
    import time

    if __name__ == "__main__":
    url = "http://localhost:8000/"
    prompt = "It's finally working! Now "

    post_response = requests.post(url, params={"prompt": prompt})
    post_response.raise_for_status()

    if post_response.status_code == 200:
    data = post_response.json()
    task_id = data["task_id"]
    else:
    raise ("Error:", post_response.text)

    get_url = f"{url}response/{task_id}"
    while True:
    try:
    get_response = requests.get(get_url)
    if get_response.status_code == 200:
    print("Task completed!")
    print("Response:", get_response.json())
    break # Exit the loop

    elif get_response.status_code == 202:
    print(f"{get_response.json()['detail']}")

    # Wait for 0.1 seconds before retrying
    time.sleep(0.1)

    else:
    get_response.raise_for_status()

    except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")
    break # Exit the loop on request errors

    To send multiple prompts to be batched together, simply loop the function call requests.post(url, params={"prompt": prompt}) for each prompt. Create a list of the task_ids it returns to retrieve the responses with a for loop over requests.get(f"{url}response/{task_id}”).

    You can also call the API from outside the VM by updating the url to replace localhost with the ip address and handling any required authentication.

    Conclusion

    TensorRT is a powerful tool for accelerating large language model inference, and its deployment on Ori's Virtual Machines provides an efficient and cost-effective solution for high-performance AI applications. While the TensorRT LLM API is still evolving, it already offers impressive features for optimizing and managing LLM inference. This tutorial demonstrates how to get started with TensorRT, showcasing its ability to handle batch inference and the potential for real-world applications.

    For ML engineers looking to maximise GPU utilisation and minimise latency, Ori's platform combined with TensorRT offers an ideal setup to experiment, deploy, and scale AI models. As NVIDIA continues to refine TensorRT and its LLM capabilities, its adoption will undoubtedly grow, making it a cornerstone for AI inference in production environments.

    Share