How to deploy an interactive chatbot with Ori Inference Endpoints and Gradio

Today, there are plenty of options to run inference on popular generative AI models. However, several of these inference services are inflexible for business use cases or too complex and expensive. That’s where Ori Inference Endpoints comes in as an effortless and scalable way to deploy state-of-the-art machine learning models with dedicated GPUs. In this tutorial, we’ll learn how to create an interactive chatbot application, powered by Ori Inference Endpoints.
Effortless, Secure Inference at any Scale
Deploy the model of your choice: Whether it’s DeepSeek R1, Llama 3, Qwenor Mistral, deploying a multi-billion parameter model is just a click away.
Select a GPU and region. Unlock seamless inference: serve your models on NVIDIA H100 SXM, H100 PCIe, L40S or L4 GPUs, with H200 GPUs coming soon, and deploy in a region that helps minimize latency for your users.
Autoscale without limits: Ori Inference Endpoints automatically scales up or down based on demand. You can also scale all the way down to zero, helping you reduce GPU costs when your endpoints are idle.
Optimized for quick starts: Model loading is designed to launch instantly, making scaling fast, even when starting from zero.
HTTPS secured API endpoints: Experience peace of mind with HTTPS endpoints and authentication to keep them safe from unauthorized use.
Pay for what you use, by the minute: Starting at $0.021/min, our per-minute pricing helps you keep your AI infrastructure affordable and costs predictable. No long-term commitments, just transparent, usage-based billing.
Deploy your Chatbot with Ori Inference Endpoints
Prerequisites
Before you begin, make sure you:
- Spin up an Inference Endpoint on Ori. Note the endpoint's API URL and API Access Token.
- Have Python 3.10 or higher, and preferably a virtual environment.
- Install Gradio, Requests and OpenAI packages
1pip install --upgrade gradio
Note: Gradio 5 or higher must be installed.
Step 1: Build Your Gradio bot
Here's the Python script to bring your chatbot to life. It connects to your Ori inference endpoint to process user messages. In our case, we deployed the Qwen 2.5 1.5B-instruct
model to the Ori endpoint. Gradio handles the formatting of the model's response in a user-friendly and readable way, so you don't need to worry about restructuring the generated text.
Save the Python code below as chatbot.py
1import os
2from collections.abc import Callable, Generator
3from gradio.chat_interface import ChatInterface
4
5# API Configuration
6ENDPOINT_URL = os.getenv("ENDPOINT_URL")
7ENDPOINT_TOKEN = os.getenv("ENDPOINT_TOKEN")
8
9if not ENDPOINT_URL:
10 raise ValueError("ENDPOINT_URL environment variable is not set. Please set it before running the script.")
11
12if not ENDPOINT_TOKEN:
13 raise ValueError("ENDPOINT_TOKEN environment variable is not set. Please set it before running the script.")
14
15try:
16 from openai import OpenAI
17except ImportError as e:
18 raise ImportError(
19 "To use OpenAI API Client, you must install the `openai` package. You can install it with `pip install openai`."
20 ) from e
21
22
23system_message = None
24model = "model"
25client = OpenAI(api_key=ENDPOINT_TOKEN, base_url=f"{ENDPOINT_URL}/openai/v1/")
26start_message = (
27 [{"role": "system", "content": system_message}] if system_message else []
28)
29streaming = True
30
31def open_api(message: str, history: list | None) -> str | None:
32 history = history or start_message
33 if len(history) > 0 and isinstance(history[0], (list, tuple)):
34 history = ChatInterface._tuples_to_messages(history)
35 return (
36 client.chat.completions.create(
37 model=model,
38 messages=history + [{"role": "user", "content": message}],
39 )
40 .choices[0]
41 .message.content
42 )
43
44def open_api_stream(
45 message: str, history: list | None
46) -> Generator[str, None, None]:
47 history = history or start_message
48 if len(history) > 0 and isinstance(history[0], (list, tuple)):
49 history = ChatInterface._tuples_to_messages(history)
50 stream = client.chat.completions.create(
51 model=model,
52 messages=history + [{"role": "user", "content": message}],
53 stream=True,
54 )
55 response = ""
56 for chunk in stream:
57 if chunk.choices[0].delta.content is not None:
58 response += chunk.choices[0].delta.content
59 yield response
60
61ChatInterface(
62 open_api_stream if streaming else open_api,
63 type="messages",
64).launch(share=True)
Step 2: Set the ENDPOINT_TOKEN and ENDPOINT_URL environment variables:
1export ENDPOINT_TOKEN="your_api_token"
2export ENDPOINT_URL="your_url"
Step 3: Run the script
1python chatbot.py
Step 4: Open the Gradio link provided in your browser
See your chatbot in action
Once your chatbot is live, it will look something like this:

The input box allows you to type messages and the area below it displays the conversation.
Run limitless AI Inference on Ori
Serve state-of-the-art AI models to your users in minutes, without breaking your infrastructure budget.
- Deploy your favorite AI model in a single click. Ori Inference Endpoints makes inference effortless.
- Automatically scale inference up or down based on demand, from thousands of GPUs all the way down to zero.
- Per-minute pricing helps you keep your inference infrastructure affordable and costs predictable.