Computer vision workloads – object detection, segmentation, image embedding, multi-camera analytics – scale almost linearly with GPU throughput, and the RTX 5060 Ti 16GB is a sweet spot for the class. Blackwell’s 5th-gen tensor cores, 448 GB/s memory bandwidth and native INT8/FP8 give it roughly 1,450 FPS on YOLOv8 nano via TensorRT and enough VRAM for 30+ concurrent HD camera streams. Here’s what that looks like in production on a Gigagpu UK dedicated node.
Contents
- YOLOv8 throughput
- Multi-stream capacity
- CLIP and image embeddings
- Segmentation and pose
- Deployment notes
- Choosing a stack
YOLOv8 throughput
YOLOv8 is still the default real-time detector for most production CV systems. Numbers below are 640×640 input, FP16 weights, single GPU, measured with batch=1 for latency and batch=8 for throughput.
| Model | Params | PyTorch FPS | ONNX FPS | TensorRT FP16 FPS | TensorRT INT8 FPS |
|---|---|---|---|---|---|
| YOLOv8n | 3.2M | 720 | 940 | 1,450 | 1,820 |
| YOLOv8s | 11.2M | 510 | 680 | 1,050 | 1,340 |
| YOLOv8m | 25.9M | 310 | 410 | 620 | 790 |
| YOLOv8l | 43.7M | 210 | 280 | 425 | 540 |
| YOLOv8x | 68.2M | 135 | 180 | 275 | 350 |
Multi-stream capacity
In practice, a GPU serving CCTV/analytics traffic is bound by concurrent decoded frames rather than model FPS alone. NVDEC on Blackwell handles 8 simultaneous 1080p30 streams per engine, and with batched inference the card comfortably covers 30+ HD cameras for realtime detection.
| Model | Per-frame ms | 1080p30 streams | 720p25 streams | VRAM used |
|---|---|---|---|---|
| YOLOv8n TRT INT8 | 0.55 | 60+ | 100+ | 0.4 GB |
| YOLOv8s TRT INT8 | 0.75 | 44 | 75 | 0.6 GB |
| YOLOv8m TRT INT8 | 1.27 | 26 | 44 | 0.9 GB |
| YOLOv8l TRT INT8 | 1.85 | 18 | 30 | 1.3 GB |
A 30-camera 1080p25 deployment with YOLOv8s INT8 uses around 15% of the card’s compute, which leaves significant headroom for downstream tasks: tracking (ByteTrack, BoT-SORT), ReID embeddings, license-plate OCR.
CLIP and image embeddings
For visual search, duplicate detection and content moderation, CLIP-based embeddings are the workhorse. On the 5060 Ti:
| Model | Precision | Images/s (BS=64) | Dimension |
|---|---|---|---|
| CLIP ViT-B/32 | FP16 | 4,200 | 512 |
| CLIP ViT-B/16 | FP16 | 2,100 | 512 |
| CLIP ViT-L/14 | FP16 | 780 | 768 |
| SigLIP-SO400M | FP16 | 650 | 1,152 |
| DINOv2 ViT-B/14 | FP16 | 1,900 | 768 |
4,200 images/second on ViT-B/32 translates to 15 million images/hour, which is enough to embed Unsplash-scale datasets in an afternoon on a single card.
Segmentation and pose
- YOLOv8n-seg – 540 FPS PyTorch, 980 FPS TensorRT FP16.
- YOLOv8m-seg – 230 FPS PyTorch, 420 FPS TensorRT.
- SAM2 (hiera-b+) – 42 FPS on 1024×1024 mask prediction.
- YOLOv8n-pose – 680 FPS PyTorch, 1,250 FPS TensorRT.
- RT-DETR-L – 195 FPS TensorRT at 640×640.
Deployment notes
# Export YOLOv8s to TensorRT INT8
yolo export model=yolov8s.pt format=engine device=0 \
half=False int8=True data=coco.yaml workspace=4
# Triton with TensorRT backend, dynamic batching
tritonserver --model-repository=/models --strict-model-config=false \
--log-verbose=1
Choosing a stack
- < 10 HD streams -> YOLOv8s/m PyTorch is fine, easiest to operate.
- 10-30 HD streams -> YOLOv8s TensorRT FP16/INT8 + Triton dynamic batching.
- 30+ streams or < 1 ms latency -> YOLOv8n INT8, batched.
- Visual search -> CLIP ViT-B/16 or SigLIP in FP16.
- Medical / industrial segmentation -> SAM2 + domain fine-tunes.
Power your CV pipeline on a single Blackwell GPU
1,400+ FPS, 30 HD cameras, 15M images/hour embedded. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: YOLOv8 benchmark, PaddleOCR benchmark, embedding throughput, Qwen VL benchmark, Llama 3.2 Vision benchmark.