Nvidia’s Triton Inference Server is the right engine when your workload is broader than a single LLM. It serves TensorRT, ONNX, PyTorch, and TensorFlow models through one unified API on dedicated GPU hosting. It is overkill for a single chat model. It is ideal for vision, audio, and ensemble pipelines.
Contents
When Triton Fits
Triton excels when:
- You serve multiple model types (vision + audio + text) from one server
- You need ensemble pipelines (OCR -> LLM -> speech synthesis)
- You have TensorRT-optimised models
- You need cross-framework uniformity for client code
For pure LLM serving, vLLM or SGLang are easier and faster.
Model Repository
models/
whisper/
1/
model.onnx
config.pbtxt
yolov8/
1/
model.plan
config.pbtxt
llama3/
1/
model.py # Python backend
config.pbtxt
Each model has a versioned directory and a config.pbtxt that declares inputs, outputs, and instance count.
Config Knobs
name: "whisper"
backend: "onnxruntime"
max_batch_size: 16
input [
{ name: "audio", data_type: TYPE_FP32, dims: [-1] }
]
output [
{ name: "text", data_type: TYPE_STRING, dims: [1] }
]
instance_group [
{ count: 2, kind: KIND_GPU }
]
dynamic_batching { max_queue_delay_microseconds: 10000 }
instance_group sets how many CUDA contexts serve this model. dynamic_batching lets Triton combine requests arriving within a time window into one batch.
Ensembles
An ensemble is a named pipeline across multiple models, chained via tensor outputs. For OCR-then-LLM, you declare:
platform: "ensemble"
ensemble_scheduling {
step [
{ model_name: "paddleocr" ... },
{ model_name: "llama3" ... }
]
}
Clients hit the ensemble endpoint; Triton routes internally. For more complex logic use BLS (Business Logic Scripting) via a Python backend.
Triton Preconfigured for Your Models
Multi-framework GPU inference servers on our UK hosting with model repositories in place.
Browse GPU ServersSee OCR-LLM pipeline and self-hosted OpenAI-compatible API for simpler alternatives when Triton’s breadth isn’t needed.