Tutorials

How to deploy an interactive chatbot with Ori Inference Endpoints and Gradio

Adrian Matei
Posted : February, 10, 2025
Posted : February, 10, 2025
    Ori AI Cloud

    Today, there are plenty of options to run inference on popular generative AI models. However, several of these inference services are inflexible for business use cases or too complex and expensive. That’s where Ori Inference Endpoints comes in as an effortless and scalable way to deploy state-of-the-art machine learning models with dedicated GPUs. In this tutorial, we’ll learn how to create an interactive chatbot application, powered by Ori Inference Endpoints.

    Effortless, Secure Inference at any Scale

    Deploy the model of your choice: Whether it’s DeepSeek R1, Llama 3, Qwenor Mistral, deploying a multi-billion parameter model is just a click away. 

    Select a GPU and region. Unlock seamless inference: serve your models on NVIDIA H100 SXM, H100 PCIe, L40S or L4 GPUs, with H200 GPUs coming soon, and deploy in a region that helps minimize latency for your users. 

    Autoscale without limits: Ori Inference Endpoints automatically scales up or down based on demand. You can also scale all the way down to zero, helping you reduce GPU costs when your endpoints are idle.

    Optimized for quick starts: Model loading is designed to launch instantly, making scaling fast, even when starting from zero.

    HTTPS secured API endpoints: Experience peace of mind with HTTPS endpoints and authentication to keep them safe from unauthorized use. 

    Pay for what you use, by the minute: Starting at $0.021/min, our per-minute pricing helps you keep your AI infrastructure affordable and costs predictable. No long-term commitments, just transparent, usage-based billing.

    Deploy your Chatbot with Ori Inference Endpoints

    Prerequisites

    Before you begin, make sure you:

    • Spin up an Inference Endpoint on Ori. Note the endpoint's API URL and API Access Token.
    • Have Python 3.10 or higher, and preferably a virtual environment.
    • Install Gradio, Requests and OpenAI packages
    Bash/ShellCopy
    1pip install --upgrade gradio

    Note: Gradio 5 or higher must be installed.

    Step 1: Build Your Gradio bot​

    Here's the Python script to bring your chatbot to life. It connects to your Ori inference endpoint to process user messages. In our case, we deployed the Qwen 2.5 1.5B-instruct model to the Ori endpoint. Gradio handles the formatting of the model's response in a user-friendly and readable way, so you don't need to worry about restructuring the generated text. 

    Save the Python code below as chatbot.py

    PythonCopy
    1import os
    2from collections.abc import Callable, Generator
    3from gradio.chat_interface import ChatInterface
    4
    5# API Configuration
    6ENDPOINT_URL = os.getenv("ENDPOINT_URL")
    7ENDPOINT_TOKEN = os.getenv("ENDPOINT_TOKEN")
    8
    9if not ENDPOINT_URL:
    10   raise ValueError("ENDPOINT_URL environment variable is not set. Please set it before running the script.")
    11
    12if not ENDPOINT_TOKEN:
    13   raise ValueError("ENDPOINT_TOKEN environment variable is not set. Please set it before running the script.")
    14
    15try:
    16   from openai import OpenAI
    17except ImportError as e:
    18   raise ImportError(
    19       "To use OpenAI API Client, you must install the `openai` package. You can install it with `pip install openai`."
    20   ) from e
    21
    22
    23system_message = None
    24model = "model"
    25client = OpenAI(api_key=ENDPOINT_TOKEN, base_url=f"{ENDPOINT_URL}/openai/v1/")
    26start_message = (
    27   [{"role": "system", "content": system_message}] if system_message else []
    28)
    29streaming = True
    30
    31def open_api(message: str, history: list | None) -> str | None:
    32   history = history or start_message
    33   if len(history) > 0 and isinstance(history[0], (list, tuple)):
    34       history = ChatInterface._tuples_to_messages(history)
    35   return (
    36       client.chat.completions.create(
    37           model=model,
    38           messages=history + [{"role": "user", "content": message}],
    39       )
    40       .choices[0]
    41       .message.content
    42   )
    43
    44def open_api_stream(
    45   message: str, history: list | None
    46) -> Generator[str, None, None]:
    47   history = history or start_message
    48   if len(history) > 0 and isinstance(history[0], (list, tuple)):
    49       history = ChatInterface._tuples_to_messages(history)
    50   stream = client.chat.completions.create(
    51       model=model,
    52       messages=history + [{"role": "user", "content": message}],
    53       stream=True,
    54   )
    55   response = ""
    56   for chunk in stream:
    57       if chunk.choices[0].delta.content is not None:
    58           response += chunk.choices[0].delta.content
    59           yield response
    60
    61ChatInterface(
    62   open_api_stream if streaming else open_api,
    63   type="messages",
    64).launch(share=True)

    Step 2: Set the ENDPOINT_TOKEN and ENDPOINT_URL environment variables:

    Bash/ShellCopy
    1export ENDPOINT_TOKEN="your_api_token"
    2export ENDPOINT_URL="your_url"

    Step 3: Run the script

    Bash/ShellCopy
    1python chatbot.py

    Step 4: Open the Gradio link provided in your browser

    See your chatbot in action

    Once your chatbot is live, it will look something like this:

    Chatbot in action

    The input box allows you to type messages and the area below it displays the conversation.

    Run limitless AI Inference on Ori

    Serve state-of-the-art AI models to your users in minutes, without breaking your infrastructure budget.


    Share