ONNX Runtime: Unleash On-Device AI

Tl;dr

ONNX Runtime is your cross-platform, hardware-accelerated inference engine to deploy fast and private AI anywhere.

What even is ONNX?

Onix - The rock-snake pokemon ONNX Runtime (pronounced like “Onix”, the rock-snake Pokémon) is a performance-focused scoring and inference engine for models in the Open Neural Network Exchange (ONNX) format, designed to support heavy workloads and deliver fast, reliable predictions in production scenarios. Developed by Microsoft, it provides APIs in Python, C++, C#, Java, and JavaScript, and runs on Windows, Linux, and macOS. Its core is implemented in C++ for speed, and it seamlessly interoperates with ONNX-exported models from PyTorch, TensorFlow, Keras, scikit-learn, LightGBM, XGBoost, and more.

Why even use it?

Imagine you’ve built an application that classifies your pet’s breed in real time — no cloud round-trip needed. Video in, and “Bichon Frisé!” out, in 50 ms. That’s ONNX Runtime shaving off latency so your users never notice the AI under the hood. (Don’t ask me how I know what a Bichon Frisé is)

This is a bichon frise btw ONNX Runtime delivers significant benefits by decoupling model training from deployment, allowing you to train in your preferred framework and deploy with a single, optimized runtime. It applies graph-level optimizations such as operator fusion, constant folding, and dead code elimination offline, reducing model size and startup latency without runtime overhead. Additionally, it leverages hardware accelerators (like GPUs, Intel MKL-DNN, NVIDIA TensorRT, and ARM Compute Library) to further boost inference throughput and lower costs.

ONNX Runtime for your gpu-poor infra

Converting any framework to onnx for cross-platform inference

Deploying models locally on devices (laptops, smartphones, embedded systems) with ONNX Runtime enables ultra-low-latency predictions and ensures data never leaves the device, preserving user privacy and reducing dependency on network connectivity. Edge and mobile deployments benefit from the same optimization pipeline, as ONNX Runtime binaries can be tailored with only necessary operators and optimizations enabled, minimizing binary size and memory footprint. Furthermore, experimental on-device training extends the capability to fine-tune models locally, offering personalized experiences without compromising data security.

Converting a model to ONNX

ONNX Runtime supports exporting models from PyTorch, TensorFlow, Keras, and more to ONNX, enabling consistent inference across frameworks.

For simplicity, let’s just look at a quick scikit-learn conversion example.

Let’s go ahead and install the necessary dependencies first

pip install numpy scikit-learn onnxruntime skl2onnx

We’ll now build a simple LogisticRegression model, and use the most original and niche dataset for this — The iris dataset. (heavy stuff 🙂)

#import numpy, scikit and onnx libs
import numpy as np
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from skl2onnx import convert_sklearn
from skl2onnx.common.data_types import FloatTensorType
import onnxruntime as ort

# load the iris dataset
iris = load_iris()
X, y = iris.data.astype(np.float32), iris.target

# lets stick to logistic reg for this, you can use your dense neural nets later
model = LogisticRegression(max_iter=1000)
model.fit(X, y)

# define the input type - onnx conversion needs this
initial_type = [("float_input", FloatTensorType([None, X.shape[1]]))]

#convert to onnx
onx_model = convert_sklearn(model, initial_types=initial_type)

# save your onnx model
with open("model.onnx", "wb") as f:
    f.write(onx_model.SerializeToString())

# test inference to check if this actually works
session = ort.InferenceSession("model.onnx")
input_name = session.get_inputs()[0].name
output_name = session.get_outputs()[0].name
predictions = session.run([output_name], {input_name: X[:5]})

# hope it predicts the iris type right so that it drives your company's shareholder value
print(predictions)

To sum stuff up

ONNX Runtime empowers developers to deploy fast, resource-efficient, and privacy-preserving inference pipelines on any device, from cloud servers to edge hardware. Its seamless interoperability with major ML frameworks and on-device optimization capabilities make it the go-to choice for production scenarios.

ONNX Runtime also offers a high performance training engine for models, defined using the PyTorch frontend. I might make a post on this if it gathers any interest, but for now enjoy the docs.

I also plan to put out an ONNX Runtime GenAI article for running ONNX generative AI models, because who doesn’t want fast on-device inference for your “How many R’s are in strawberry” prompts? :)

Tl;dr#

What even is ONNX?#

Why even use it?#

ONNX Runtime for your gpu-poor infra#

Converting a model to ONNX#

To sum stuff up#