Home / Blog / Tutorials / TensorRT-LLM on Dedicated GPU: Optimisation Guide

Tutorials

TensorRT-LLM on Dedicated GPU: Optimisation Guide

Deploy TensorRT-LLM on a dedicated GPU server for maximum inference speed. Covers engine building, INT4/INT8 quantisation, Triton Inference Server integration, and throughput comparison with vLLM.

Tutorials April 17, 2026 3 min read gigagpu

Table of Contents

What Is TensorRT-LLM and Why Use It
GPU Requirements and Recommendations
Installation and Engine Building
INT4 and INT8 Quantisation
Throughput Comparison: TRT-LLM vs vLLM
Triton Inference Server for Production

What Is TensorRT-LLM and Why Use It

TensorRT-LLM is NVIDIA’s high-performance inference library that compiles LLMs into optimised TensorRT engines. On a dedicated GPU server, it delivers the absolute fastest token generation by fusing operations, applying kernel-level optimisations, and using in-flight batching. The trade-off is a more complex setup compared to vLLM or Ollama, but the performance gains are substantial for latency-critical workloads.

TensorRT-LLM is the right choice when you need every last token per second, when you are serving thousands of concurrent users, or when your fintech or voice agent application has strict latency SLAs.

GPU Requirements and Recommendations

GPU	VRAM	TRT-LLM Support	Best For
RTX 4060	8 GB	Yes	7B INT4 engines
RTX 3090	24 GB	Yes	7B-13B engines, 34B INT4
RTX 5080	16 GB	Yes (Blackwell)	7B FP16, FP4 engines
RTX 5090	32 GB	Yes (Blackwell)	13B FP16, 34B+ INT4

TensorRT-LLM requires building a compiled engine specific to your GPU architecture (Ampere, Ada Lovelace, Blackwell). Engines are not portable between architectures.

Installation and Engine Building

# Pull the official TensorRT-LLM Docker image
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

# Clone TensorRT-LLM examples
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama

# Convert Llama 3 8B to TRT-LLM checkpoint
python convert_checkpoint.py \
  --model_dir /models/Llama-3-8B-Instruct \
  --output_dir /engines/llama3-8b-ckpt \
  --dtype float16

# Build the TensorRT engine
trtllm-build \
  --checkpoint_dir /engines/llama3-8b-ckpt \
  --output_dir /engines/llama3-8b-engine \
  --max_batch_size 16 \
  --max_input_len 2048 \
  --max_seq_len 4096 \
  --gemm_plugin float16

Engine building takes 10-30 minutes depending on model size and GPU. The resulting engine is a static binary optimised for your specific hardware. For basic CUDA setup, see the CUDA installation guide.

INT4 and INT8 Quantisation

# Build an INT4 AWQ engine for larger models
python convert_checkpoint.py \
  --model_dir /models/Llama-3-13B-Instruct \
  --output_dir /engines/llama3-13b-int4-ckpt \
  --dtype float16 \
  --use_weight_only \
  --weight_only_precision int4_awq

trtllm-build \
  --checkpoint_dir /engines/llama3-13b-int4-ckpt \
  --output_dir /engines/llama3-13b-int4-engine \
  --max_batch_size 8 \
  --max_input_len 2048 \
  --max_seq_len 4096

TensorRT-LLM supports INT4 AWQ, INT4 GPTQ, INT8 SmoothQuant, and FP8 quantisation. AWQ generally provides the best quality-speed trade-off. For a detailed quantisation comparison, see the GPTQ vs AWQ vs GGUF guide.

Throughput Comparison: TRT-LLM vs vLLM

Model	GPU	TRT-LLM (t/s)	vLLM (t/s)	TRT-LLM Gain
Llama 3 8B FP16	RTX 3090	~72	~55	+31%
Llama 3 8B INT4	RTX 3090	~108	~82	+32%
Llama 3 8B FP16	RTX 5090	~148	~115	+29%
Llama 3 13B INT4	RTX 3090	~50	~38	+32%
Mistral 7B FP16	RTX 5080	~118	~92	+28%

TensorRT-LLM consistently delivers 28-35% faster inference than vLLM for compiled engines. The gap is larger for batch workloads. Compare costs across GPU tiers with the LLM cost calculator.

Triton Inference Server for Production

For production deployment, pair TensorRT-LLM engines with NVIDIA Triton Inference Server:

# Create Triton model repository
mkdir -p /triton/models/llama3/1
cp /engines/llama3-8b-engine/* /triton/models/llama3/1/

# Start Triton
docker run --gpus all -p 8000:8000 -p 8001:8001 \
  -v /triton/models:/models \
  nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
  tritonserver --model-repository=/models

Triton adds dynamic batching, model versioning, health monitoring, and gRPC/HTTP endpoints. For securing the API, follow the secure AI inference guide. For a simpler deployment path, vLLM production setup is easier to manage. Explore more deployment options in the tutorials section.

Dedicated GPU Servers for TensorRT-LLM

Maximum inference speed on dedicated hardware. Full root access, pre-installed CUDA, UK datacentre.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Tutorials

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

TensorRT-LLM on Dedicated GPU: Optimisation Guide

What Is TensorRT-LLM and Why Use It

GPU Requirements and Recommendations

Installation and Engine Building

INT4 and INT8 Quantisation

Throughput Comparison: TRT-LLM vs vLLM

Triton Inference Server for Production

Dedicated GPU Servers for TensorRT-LLM

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

TensorRT-LLM on Dedicated GPU: Optimisation Guide

What Is TensorRT-LLM and Why Use It

GPU Requirements and Recommendations

Installation and Engine Building

INT4 and INT8 Quantisation

Throughput Comparison: TRT-LLM vs vLLM

Triton Inference Server for Production

Dedicated GPU Servers for TensorRT-LLM

Need a Dedicated GPU Server?

gigagpu

Related Articles

Webhook Integration for AI Results

Flux.1 Generation Errors: Common Fixes

Fine-Tune LLaMA 3 8B with LoRA: GPU & VRAM Guide

Cross-Encoder vs Bi-Encoder for Reranking

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?