RTX 3050 - Order Now
Home / Blog / Tutorials / TensorRT-LLM on Dedicated GPU: Optimisation Guide
Tutorials

TensorRT-LLM on Dedicated GPU: Optimisation Guide

Deploy TensorRT-LLM on a dedicated GPU server for maximum inference speed. Covers engine building, INT4/INT8 quantisation, Triton Inference Server integration, and throughput comparison with vLLM.

What Is TensorRT-LLM and Why Use It

TensorRT-LLM is NVIDIA’s high-performance inference library that compiles LLMs into optimised TensorRT engines. On a dedicated GPU server, it delivers the absolute fastest token generation by fusing operations, applying kernel-level optimisations, and using in-flight batching. The trade-off is a more complex setup compared to vLLM or Ollama, but the performance gains are substantial for latency-critical workloads.

TensorRT-LLM is the right choice when you need every last token per second, when you are serving thousands of concurrent users, or when your fintech or voice agent application has strict latency SLAs.

GPU Requirements and Recommendations

GPUVRAMTRT-LLM SupportBest For
RTX 40608 GBYes7B INT4 engines
RTX 309024 GBYes7B-13B engines, 34B INT4
RTX 508016 GBYes (Blackwell)7B FP16, FP4 engines
RTX 509032 GBYes (Blackwell)13B FP16, 34B+ INT4

TensorRT-LLM requires building a compiled engine specific to your GPU architecture (Ampere, Ada Lovelace, Blackwell). Engines are not portable between architectures.

Installation and Engine Building

# Pull the official TensorRT-LLM Docker image
docker pull nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3

# Clone TensorRT-LLM examples
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/llama

# Convert Llama 3 8B to TRT-LLM checkpoint
python convert_checkpoint.py \
  --model_dir /models/Llama-3-8B-Instruct \
  --output_dir /engines/llama3-8b-ckpt \
  --dtype float16

# Build the TensorRT engine
trtllm-build \
  --checkpoint_dir /engines/llama3-8b-ckpt \
  --output_dir /engines/llama3-8b-engine \
  --max_batch_size 16 \
  --max_input_len 2048 \
  --max_seq_len 4096 \
  --gemm_plugin float16

Engine building takes 10-30 minutes depending on model size and GPU. The resulting engine is a static binary optimised for your specific hardware. For basic CUDA setup, see the CUDA installation guide.

INT4 and INT8 Quantisation

# Build an INT4 AWQ engine for larger models
python convert_checkpoint.py \
  --model_dir /models/Llama-3-13B-Instruct \
  --output_dir /engines/llama3-13b-int4-ckpt \
  --dtype float16 \
  --use_weight_only \
  --weight_only_precision int4_awq

trtllm-build \
  --checkpoint_dir /engines/llama3-13b-int4-ckpt \
  --output_dir /engines/llama3-13b-int4-engine \
  --max_batch_size 8 \
  --max_input_len 2048 \
  --max_seq_len 4096

TensorRT-LLM supports INT4 AWQ, INT4 GPTQ, INT8 SmoothQuant, and FP8 quantisation. AWQ generally provides the best quality-speed trade-off. For a detailed quantisation comparison, see the GPTQ vs AWQ vs GGUF guide.

Throughput Comparison: TRT-LLM vs vLLM

ModelGPUTRT-LLM (t/s)vLLM (t/s)TRT-LLM Gain
Llama 3 8B FP16RTX 3090~72~55+31%
Llama 3 8B INT4RTX 3090~108~82+32%
Llama 3 8B FP16RTX 5090~148~115+29%
Llama 3 13B INT4RTX 3090~50~38+32%
Mistral 7B FP16RTX 5080~118~92+28%

TensorRT-LLM consistently delivers 28-35% faster inference than vLLM for compiled engines. The gap is larger for batch workloads. Compare costs across GPU tiers with the LLM cost calculator.

Triton Inference Server for Production

For production deployment, pair TensorRT-LLM engines with NVIDIA Triton Inference Server:

# Create Triton model repository
mkdir -p /triton/models/llama3/1
cp /engines/llama3-8b-engine/* /triton/models/llama3/1/

# Start Triton
docker run --gpus all -p 8000:8000 -p 8001:8001 \
  -v /triton/models:/models \
  nvcr.io/nvidia/tritonserver:24.12-trtllm-python-py3 \
  tritonserver --model-repository=/models

Triton adds dynamic batching, model versioning, health monitoring, and gRPC/HTTP endpoints. For securing the API, follow the secure AI inference guide. For a simpler deployment path, vLLM production setup is easier to manage. Explore more deployment options in the tutorials section.

Dedicated GPU Servers for TensorRT-LLM

Maximum inference speed on dedicated hardware. Full root access, pre-installed CUDA, UK datacentre.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?