Today, there are plenty of options to run inference on popular generative AI models. However, several of these inference services are inflexible for business use cases or too complex and expensive. That’s where Ori Inference Endpoints comes in as an effortless and scalable way to deploy state-of-the-art machine learning models with dedicated GPUs. In this tutorial, we’ll learn how to create an interactive chatbot application, powered by Ori Inference Endpoints.

Effortless, Secure Inference at any Scale

Deploy the model of your choice: Whether it’s DeepSeek R1, Llama 3, Qwenor Mistral, deploying a multi-billion parameter model is just a click away.

Select a GPU and region. Unlock seamless inference: serve your models on NVIDIA H100 SXM, H100 PCIe, L40S or L4 GPUs, with H200 GPUs coming soon, and deploy in a region that helps minimize latency for your users.

Autoscale without limits: Ori Inference Endpoints automatically scales up or down based on demand. You can also scale all the way down to zero, helping you reduce GPU costs when your endpoints are idle.

Optimized for quick starts: Model loading is designed to launch instantly, making scaling fast, even when starting from zero.

HTTPS secured API endpoints: Experience peace of mind with HTTPS endpoints and authentication to keep them safe from unauthorized use.

Pay for what you use, by the minute: Starting at $0.021/min, our per-minute pricing helps you keep your AI infrastructure affordable and costs predictable. No long-term commitments, just transparent, usage-based billing.

Deploy your Chatbot with Ori Inference Endpoints

Prerequisites

Before you begin, make sure you:

Spin up an Inference Endpoint on Ori. Note the endpoint's API URL and API Access Token.
Have Python 3.10 or higher, and preferably a virtual environment.
Install Gradio, Requests and OpenAI packages

Bash/ShellCopy

1pip install --upgrade gradio

Note: Gradio 5 or higher must be installed.

Step 1: Build Your Gradio bot

Here's the Python script to bring your chatbot to life. It connects to your Ori inference endpoint to process user messages. In our case, we deployed the Qwen 2.5 1.5B-instruct model to the Ori endpoint. Gradio handles the formatting of the model's response in a user-friendly and readable way, so you don't need to worry about restructuring the generated text.

Save the Python code below as chatbot.py

PythonCopy

1import os
2from collections.abc import Callable, Generator
3from gradio.chat_interface import ChatInterface
4
5# API Configuration
6ENDPOINT_URL = os.getenv("ENDPOINT_URL")
7ENDPOINT_TOKEN = os.getenv("ENDPOINT_TOKEN")
8
9if not ENDPOINT_URL:
10   raise ValueError("ENDPOINT_URL environment variable is not set. Please set it before running the script.")
11
12if not ENDPOINT_TOKEN:
13   raise ValueError("ENDPOINT_TOKEN environment variable is not set. Please set it before running the script.")
14
15try:
16   from openai import OpenAI
17except ImportError as e:
18   raise ImportError(
19       "To use OpenAI API Client, you must install the `openai` package. You can install it with `pip install openai`."
20   ) from e
21
22
23system_message = None
24model = "model"
25client = OpenAI(api_key=ENDPOINT_TOKEN, base_url=f"{ENDPOINT_URL}/openai/v1/")
26start_message = (
27   [{"role": "system", "content": system_message}] if system_message else []
28)
29streaming = True
30
31def open_api(message: str, history: list | None) -> str | None:
32   history = history or start_message
33   if len(history) > 0 and isinstance(history[0], (list, tuple)):
34       history = ChatInterface._tuples_to_messages(history)
35   return (
36       client.chat.completions.create(
37           model=model,
38           messages=history + [{"role": "user", "content": message}],
39       )
40       .choices[0]
41       .message.content
42   )
43
44def open_api_stream(
45   message: str, history: list | None
46) -> Generator[str, None, None]:
47   history = history or start_message
48   if len(history) > 0 and isinstance(history[0], (list, tuple)):
49       history = ChatInterface._tuples_to_messages(history)
50   stream = client.chat.completions.create(
51       model=model,
52       messages=history + [{"role": "user", "content": message}],
53       stream=True,
54   )
55   response = ""
56   for chunk in stream:
57       if chunk.choices[0].delta.content is not None:
58           response += chunk.choices[0].delta.content
59           yield response
60
61ChatInterface(
62   open_api_stream if streaming else open_api,
63   type="messages",
64).launch(share=True)

Step 2: Set the ENDPOINT_TOKEN and ENDPOINT_URL environment variables:

Bash/ShellCopy

1export ENDPOINT_TOKEN="your_api_token"
2export ENDPOINT_URL="your_url"

Step 3: Run the script

Bash/ShellCopy

1python chatbot.py

Step 4: Open the Gradio link provided in your browser

See your chatbot in action

Once your chatbot is live, it will look something like this:

The input box allows you to type messages and the area below it displays the conversation.

Run limitless AI Inference on Ori

Serve state-of-the-art AI models to your users in minutes, without breaking your infrastructure budget.

Deploy your favorite AI model in a single click. Ori Inference Endpoints makes inference effortless.
Automatically scale inference up or down based on demand, from thousands of GPUs all the way down to zero.
Per-minute pricing helps you keep your inference infrastructure affordable and costs predictable.

Build limitless AI on Ori

Chart your own AI reality with Ori's comprehensive AI cloud platform.

How to deploy an interactive chatbot with Ori Inference Endpoints and Gradio

Effortless, Secure Inference at any Scale

Deploy your Chatbot with Ori Inference Endpoints

Prerequisites​

See your chatbot in action​

Run limitless AI Inference on Ori

Build limitless AI on Ori

Prerequisites

See your chatbot in action