Table of Contents
What Is TensorRT-LLM and Why Use It
TensorRT-LLM is NVIDIA’s high-performance inference library that compiles LLMs into optimised TensorRT engines. On a dedicated GPU server, it delivers the absolute fastest token generation by fusing operations, applying kernel-level optimisations, and using in-flight batching. The trade-off is a more complex setup compared to vLLM or Ollama, but the performance gains are substantial for latency-critical workloads.
TensorRT-LLM is the right choice when you need every last token per second, when you are serving thousands of concurrent users, or when your fintech or voice agent application has strict latency SLAs.
GPU Requirements and Recommendations
| GPU | VRAM | TRT-LLM Support | Best For |
|---|---|---|---|
| RTX 4060 | 8 GB | Yes | 7B INT4 engines |
| RTX 3090 | 24 GB | Yes | 7B-13B engines, 34B INT4 |
| RTX 5080 | 16 GB | Yes (Blackwell) | 7B FP16, FP4 engines |
| RTX 5090 | 32 GB | Yes (Blackwell) | 13B FP16, 34B+ INT4 |
TensorRT-LLM requires building a compiled engine specific to your GPU architecture (Ampere, Ada Lovelace, Blackwell). Engines are not portable between architectures.
Installation and Engine Building
# Pull the official TensorRT-LLM Docker image
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3
# Clone TensorRT-LLM examples
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama
# Convert Llama 3 8B to TRT-LLM checkpoint
python convert_checkpoint.py \
--model_dir /models/Llama-3-8B-Instruct \
--output_dir /engines/llama3-8b-ckpt \
--dtype float16
# Build the TensorRT engine
trtllm-build \
--checkpoint_dir /engines/llama3-8b-ckpt \
--output_dir /engines/llama3-8b-engine \
--max_batch_size 16 \
--max_input_len 2048 \
--max_seq_len 4096 \
--gemm_plugin float16
Engine building takes 10-30 minutes depending on model size and GPU. The resulting engine is a static binary optimised for your specific hardware. For basic CUDA setup, see the CUDA installation guide.
INT4 and INT8 Quantisation
# Build an INT4 AWQ engine for larger models
python convert_checkpoint.py \
--model_dir /models/Llama-3-13B-Instruct \
--output_dir /engines/llama3-13b-int4-ckpt \
--dtype float16 \
--use_weight_only \
--weight_only_precision int4_awq
trtllm-build \
--checkpoint_dir /engines/llama3-13b-int4-ckpt \
--output_dir /engines/llama3-13b-int4-engine \
--max_batch_size 8 \
--max_input_len 2048 \
--max_seq_len 4096
TensorRT-LLM supports INT4 AWQ, INT4 GPTQ, INT8 SmoothQuant, and FP8 quantisation. AWQ generally provides the best quality-speed trade-off. For a detailed quantisation comparison, see the GPTQ vs AWQ vs GGUF guide.
Throughput Comparison: TRT-LLM vs vLLM
| Model | GPU | TRT-LLM (t/s) | vLLM (t/s) | TRT-LLM Gain |
|---|---|---|---|---|
| Llama 3 8B FP16 | RTX 3090 | ~72 | ~55 | +31% |
| Llama 3 8B INT4 | RTX 3090 | ~108 | ~82 | +32% |
| Llama 3 8B FP16 | RTX 5090 | ~148 | ~115 | +29% |
| Llama 3 13B INT4 | RTX 3090 | ~50 | ~38 | +32% |
| Mistral 7B FP16 | RTX 5080 | ~118 | ~92 | +28% |
TensorRT-LLM consistently delivers 28-35% faster inference than vLLM for compiled engines. The gap is larger for batch workloads. Compare costs across GPU tiers with the LLM cost calculator.
Triton Inference Server for Production
For production deployment, pair TensorRT-LLM engines with NVIDIA Triton Inference Server:
# Create Triton model repository
mkdir -p /triton/models/llama3/1
cp /engines/llama3-8b-engine/* /triton/models/llama3/1/
# Start Triton
docker run --gpus all -p 8000:8000 -p 8001:8001 \
-v /triton/models:/models \
nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
tritonserver --model-repository=/models
Triton adds dynamic batching, model versioning, health monitoring, and gRPC/HTTP endpoints. For securing the API, follow the secure AI inference guide. For a simpler deployment path, vLLM production setup is easier to manage. Explore more deployment options in the tutorials section.
Dedicated GPU Servers for TensorRT-LLM
Maximum inference speed on dedicated hardware. Full root access, pre-installed CUDA, UK datacentre.
Browse GPU Servers