Table of Contents
VRAM Check: YOLOv8 on 8 GB
YOLOv8 is Ultralytics’ latest real-time object detection model. The RTX 4060 with 8 GB VRAM is an excellent budget choice for running it on a dedicated GPU server. Every YOLOv8 model size fits comfortably:
| Model | Parameters | VRAM (FP16, 640×640) | VRAM (FP16, 1280×1280) | Fits RTX 4060? |
|---|---|---|---|---|
| YOLOv8n (nano) | 3.2M | ~0.5 GB | ~1.2 GB | Yes |
| YOLOv8s (small) | 11.2M | ~0.8 GB | ~1.8 GB | Yes |
| YOLOv8m (medium) | 25.9M | ~1.2 GB | ~2.8 GB | Yes |
| YOLOv8l (large) | 43.7M | ~1.8 GB | ~4.2 GB | Yes |
| YOLOv8x (extra-large) | 68.2M | ~2.5 GB | ~5.8 GB | Yes |
Even YOLOv8x at 1280×1280 resolution uses under 6 GB, leaving room to co-host additional models like an LLM or PaddleOCR on the same GPU. For broader GPU sizing, see the best GPU for inference guide.
Setup with Ultralytics
# Install Ultralytics
pip install ultralytics
# Run object detection on an image
from ultralytics import YOLO
model = YOLO("yolov8x.pt") # auto-downloads weights
results = model("input.jpg", conf=0.25, device="cuda")
# Save results with bounding boxes
results[0].save("output.jpg")
print(results[0].boxes.data) # xyxy, confidence, class
YOLOv8 supports detection, segmentation, pose estimation, and classification from a single model family. The Ultralytics API auto-downloads weights on first run.
Building a Detection API
# Install FastAPI
pip install fastapi uvicorn python-multipart
# api.py
from fastapi import FastAPI, UploadFile
from ultralytics import YOLO
import tempfile, os
app = FastAPI()
model = YOLO("yolov8x.pt")
@app.post("/detect")
async def detect(file: UploadFile):
with tempfile.NamedTemporaryFile(delete=False, suffix=".jpg") as tmp:
tmp.write(await file.read())
tmp_path = tmp.name
results = model(tmp_path, conf=0.25, device="cuda")
detections = []
for box in results[0].boxes:
detections.append({
"class": results[0].names[int(box.cls)],
"confidence": float(box.conf),
"bbox": box.xyxy[0].tolist()
})
os.unlink(tmp_path)
return {"detections": detections}
# Run: uvicorn api:app --host 0.0.0.0 --port 8000
Read our self-host guide for server setup fundamentals and production deployment tips.
RTX 4060 Inference Benchmarks
Tested with a 1920×1080 image resized to model input size. See the benchmark tool for more data.
| Model | Input Size | Inference Time | FPS | VRAM Usage |
|---|---|---|---|---|
| YOLOv8n | 640×640 | 2.1 ms | ~476 | 0.5 GB |
| YOLOv8s | 640×640 | 3.4 ms | ~294 | 0.8 GB |
| YOLOv8m | 640×640 | 5.8 ms | ~172 | 1.2 GB |
| YOLOv8l | 640×640 | 8.7 ms | ~115 | 1.8 GB |
| YOLOv8x | 640×640 | 13.2 ms | ~76 | 2.5 GB |
YOLOv8n on the RTX 4060 achieves over 470 FPS, making it suitable for multi-stream video processing. Even YOLOv8x at 76 FPS is well above real-time for single-camera applications.
Optimisation Tips
- Export to TensorRT with
model.export(format="engine")for a 2-3x speedup on Ada Lovelace GPUs. - Use FP16 inference (default on GPU) for the best speed-to-accuracy balance.
- Batch multiple frames when processing video to maximise GPU utilisation.
- Choose the right model size: YOLOv8n for speed-critical applications, YOLOv8x for maximum accuracy.
- Use INT8 quantisation via TensorRT for additional 30-50% speedup with minimal accuracy loss.
For OCR workloads alongside detection, see our PaddleOCR hosting page. Compare GPU options with the GPU comparisons tool.
Next Steps
The RTX 4060 is ideal for YOLOv8 inference at any model size. For higher resolution processing or multi-stream detection, the RTX 4060 Ti offers more headroom. Pair YOLO with an LLM for visual question answering pipelines. Browse all deployment guides in the model guides section, or check OCR speed benchmarks for document processing alternatives.
Deploy YOLOv8 Now
Run real-time object detection on a dedicated RTX 4060 server. No per-inference API charges and full root access.
Browse GPU Servers