RTX 3050 - Order Now
Home / Blog / Tutorials / RTX 5060 Ti 16GB TGI Setup
Tutorials

RTX 5060 Ti 16GB TGI Setup

Hugging Face TGI on Blackwell 16GB - Docker run command, config for FP8 and quantised serving.

Text Generation Inference (TGI) from HuggingFace is an alternative to vLLM. Setup on the RTX 5060 Ti 16GB at our hosting:

Contents

Docker Run

docker run --gpus all --shm-size 1g -p 8080:80 \
  -v $PWD/data:/data \
  ghcr.io/huggingface/text-generation-inference:2.3 \
  --model-id meta-llama/Llama-3.1-8B-Instruct \
  --dtype float16 \
  --max-input-tokens 30000 \
  --max-total-tokens 32768 \
  --cuda-memory-fraction 0.90

Replace --dtype float16 with --quantize fp8 for native FP8 on Blackwell.

Endpoints

  • /generate – native TGI REST
  • /v1/chat/completions – OpenAI-compatible (TGI 2.0+)
  • /generate_stream – SSE streaming
  • /info – server state
  • /metrics – Prometheus metrics

Config Knobs

FlagEffect
--quantize fp8Native Blackwell FP8 (best)
--quantize awqAWQ INT4 Marlin kernels
--quantize bitsandbytes-nf44-bit NF4 (slower)
--max-concurrent-requestsQueue depth – default 128
--max-batch-prefill-tokensLike vLLM’s chunked prefill
--kv-cache-dtype fp8FP8 KV – doubles context

vs vLLM

  • vLLM tends to have slightly higher throughput on the same hardware
  • TGI has Rust router – lower overhead for many small requests
  • TGI’s metrics / observability is more polished out of the box
  • Default Docker image is a plus for ops teams already using containers
  • Both support OpenAI-compatible endpoints

For dedicated serving on this card, vLLM is usually the first choice. TGI is a strong pick when you want the HF ecosystem, containerised deployment, or the polished metrics story.

TGI on Blackwell 16GB

HuggingFace TGI containerised. UK dedicated hosting.

Order the RTX 5060 Ti 16GB

See also: vLLM setup, Ollama setup, Docker CUDA setup, FP8 Llama.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?