The RTX 5060 Ti 16GB is a standout value card for real-time object detection. With Blackwell’s updated tensor cores and 16 GB of GDDR7, it can hold all YOLOv8 and YOLOv11 variants simultaneously and, with TensorRT FP16 or INT8, push 30+ concurrent HD camera streams on a single card. This guide quantifies native PyTorch FPS, TensorRT speedups, and multi-stream capacity on our UK dedicated GPU hosting.
Contents
- YOLO model family
- Native PyTorch FPS
- TensorRT FP16/INT8 gains
- Multi-stream capacity
- VRAM budget
- Deployment recipe
YOLO model family
YOLOv8 (Ultralytics, 2023) and YOLOv11 (Ultralytics, 2024) share a very similar compute profile per variant, with v11 delivering ~2% higher mAP at the same FLOPs. Benchmarks here cover both generations. Model sizes (640×640 input):
| Variant | Params | FLOPs (G) | YOLOv8 mAP50-95 | YOLOv11 mAP50-95 |
|---|---|---|---|---|
| nano (n) | 3.2M | 8.7 | 37.3 | 39.5 |
| small (s) | 11.2M | 28.6 | 44.9 | 47.0 |
| medium (m) | 25.9M | 78.9 | 50.2 | 51.5 |
| large (l) | 43.7M | 165.2 | 52.9 | 53.4 |
| extra (x) | 68.2M | 257.8 | 53.9 | 54.7 |
Native PyTorch FPS on the 5060 Ti
Measured with Ultralytics 8.3, PyTorch 2.4, CUDA 12.5, 640×640 input, batch=1, on the RTX 5060 Ti 16GB:
| Variant | PyTorch FP32 FPS | PyTorch FP16 FPS | Latency (FP16) |
|---|---|---|---|
| YOLOv8n / v11n | 510 | 720 | 1.4 ms |
| YOLOv8s / v11s | 380 | 520 | 1.9 ms |
| YOLOv8m / v11m | 230 | 320 | 3.1 ms |
| YOLOv8l / v11l | 150 | 210 | 4.8 ms |
| YOLOv8x / v11x | 100 | 145 | 6.9 ms |
TensorRT FP16 and INT8 gains
TensorRT 10 roughly doubles throughput versus native PyTorch FP16 through kernel fusion and optimal layer scheduling. INT8 quantisation (PTQ with 1,000 COCO images for calibration) adds another 50-80% on top with a mAP50-95 drop of <0.8 points:
| Variant | TRT FP16 FPS | TRT INT8 FPS | INT8 mAP drop |
|---|---|---|---|
| YOLOv8n / v11n | 1,400 | 2,200 | -0.3 |
| YOLOv8s / v11s | 1,000 | 1,650 | -0.4 |
| YOLOv8m / v11m | 620 | 1,050 | -0.6 |
| YOLOv8l / v11l | 410 | 680 | -0.7 |
| YOLOv8x / v11x | 280 | 460 | -0.8 |
For a deeper dive see our YOLOv8 benchmark post.
Multi-stream capacity
Assume a typical CCTV feed at 1080p, 25 FPS. Per-stream compute = variant FPS / 25. Using YOLOv8m TensorRT FP16 (620 FPS) that is 24 streams at 25 FPS, or 30+ streams when you allow frame skipping to 20 FPS. With INT8 you reach 42 streams on a single card.
| Variant + Precision | Streams @ 25 FPS | Streams @ 15 FPS | Good for |
|---|---|---|---|
| YOLOv8n TRT INT8 | 88 | 146 | Edge, retail analytics |
| YOLOv8s TRT FP16 | 40 | 66 | Small retailer, car park |
| YOLOv8m TRT FP16 | 24 | 41 | Typical CCTV aggregator |
| YOLOv8m TRT INT8 | 42 | 70 | CCTV VMS core |
| YOLOv8l TRT FP16 | 16 | 27 | Higher accuracy CCTV |
VRAM budget
All YOLO variants are tiny relative to a 16 GB pool. A YOLOv8x TensorRT engine uses about 600 MB including workspace. You can load all five variants simultaneously (routing between them) and still have 13 GB free.
| Variant | TRT FP16 engine size | Runtime VRAM (bs=1) |
|---|---|---|
| YOLOv8n | 6 MB | ~160 MB |
| YOLOv8s | 22 MB | ~220 MB |
| YOLOv8m | 52 MB | ~320 MB |
| YOLOv8l | 87 MB | ~450 MB |
| YOLOv8x | 136 MB | ~600 MB |
Deployment recipe
Export with yolo export model=yolov8m.pt format=engine half=True, then wrap with DeepStream 7 or a Triton ensemble. For broader computer-vision hosting context see the computer vision guide.
30+ HD CCTV streams on a single card
YOLOv8m TensorRT FP16, 16 GB GDDR7, 180 W. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: YOLOv8 benchmark, computer vision hosting, max model size, upgrade to RTX 5090.