Text Generation Inference (TGI) from HuggingFace is an alternative to vLLM. Setup on the RTX 5060 Ti 16GB at our hosting:
Contents
Docker Run
docker run --gpus all --shm-size 1g -p 8080:80 \
-v $PWD/data:/data \
ghcr.io/huggingface/text-generation-inference:2.3 \
--model-id meta-llama/Llama-3.1-8B-Instruct \
--dtype float16 \
--max-input-tokens 30000 \
--max-total-tokens 32768 \
--cuda-memory-fraction 0.90
Replace --dtype float16 with --quantize fp8 for native FP8 on Blackwell.
Endpoints
/generate– native TGI REST/v1/chat/completions– OpenAI-compatible (TGI 2.0+)/generate_stream– SSE streaming/info– server state/metrics– Prometheus metrics
Config Knobs
| Flag | Effect |
|---|---|
--quantize fp8 | Native Blackwell FP8 (best) |
--quantize awq | AWQ INT4 Marlin kernels |
--quantize bitsandbytes-nf4 | 4-bit NF4 (slower) |
--max-concurrent-requests | Queue depth – default 128 |
--max-batch-prefill-tokens | Like vLLM’s chunked prefill |
--kv-cache-dtype fp8 | FP8 KV – doubles context |
vs vLLM
- vLLM tends to have slightly higher throughput on the same hardware
- TGI has Rust router – lower overhead for many small requests
- TGI’s metrics / observability is more polished out of the box
- Default Docker image is a plus for ops teams already using containers
- Both support OpenAI-compatible endpoints
For dedicated serving on this card, vLLM is usually the first choice. TGI is a strong pick when you want the HF ecosystem, containerised deployment, or the polished metrics story.
TGI on Blackwell 16GB
HuggingFace TGI containerised. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: vLLM setup, Ollama setup, Docker CUDA setup, FP8 Llama.