The RTX 4090 24GB carries two 8th-generation NVENC encoders and a 5th-generation NVDEC decoder, both fully usable in parallel with tensor core inference. On UK dedicated GPU hosting this combination is what makes the 4090 the most economical single-card option for video-AI pipelines in 2026: AV1 hardware encode for bandwidth savings, NVDEC for ingest, and 660 dense FP8 TFLOPS for the VLM, detection, recognition or transcription workload that runs against the decoded frames. This piece walks through the encoder and decoder blocks, the codec matrix, real measured throughput, the AI pipeline pattern that keeps frames in VRAM, and the FFmpeg and DeepStream commands that hold up under load.
Contents
- NVENC and NVDEC blocks on Ada
- Codec support matrix
- Throughput numbers per codec and resolution
- AI video pipeline pattern (zero-copy)
- FFmpeg + CUDA + DeepStream recipes
- Real workload mixes
- Production gotchas
- Verdict and when to pick the 4090
NVENC and NVDEC blocks on Ada
Ada introduced AV1 hardware encode for the first time on consumer NVIDIA. The 4090 ships with two NVENC blocks (the doubled count is the Ada flagship-only feature; the 4080 has one) and a single NVDEC. Both sit on independent clock domains from the SMs, so encoding does not steal tensor compute. The Optical Flow Accelerator (OFA) third-generation block sits alongside, used for frame interpolation and for the hardware-accelerated optical flow primitives in DLSS 3 and DeepStream’s tracker plugins.
| Block | Generation | Count on 4090 | Clock domain | Notes |
|---|---|---|---|---|
| NVENC | 8th gen (NVENC AV1) | 2 | Independent | AV1, HEVC, H.264, all 10-bit |
| NVDEC | 5th gen | 1 | Independent | AV1, HEVC, VP9, H.264, MPEG-2, JPEG (NVJPEG) |
| OFA | 3rd gen | 1 | Independent | Optical flow for DeepStream tracker |
| SMs (compute) | 4th-gen tensor cores | 128 | Main GPU clock | 660 FP8 TFLOPS dense |
The independent clock domains matter. A YOLO-v8 detection workload running on the SMs at 90 percent utilisation does not throttle the NVENC blocks; they run from a separate PLL and have their own power budget within the card’s overall envelope. The card’s 450 W TDP is shared across all blocks, but at typical 30 fps 1080p encode the NVENC pair draws maybe 25 W each, leaving plenty of headroom for SM work. See power draw efficiency for the per-block breakdown.
Codec support matrix
| Codec | Encode | Decode | 10-bit | HDR | Notes |
|---|---|---|---|---|---|
| H.264 (AVC) | Yes | Yes | No (encode) | No | Up to 4096×4096; legacy compatibility |
| HEVC (H.265) | Yes | Yes | Yes | HDR10 | Up to 8192×8192 |
| AV1 | Yes | Yes | Yes | HDR10+ | New on Ada; ~30% bitrate saving vs HEVC |
| VP9 | No | Yes | Yes | HDR10 | Decode only; for YouTube/web ingest |
| MPEG-2 | No | Yes | No | No | Legacy broadcast |
| JPEG | No | Yes (NVJPEG) | n/a | n/a | For data-loading pipelines (PyTorch) |
| MJPEG | No | Yes | No | No | Legacy IP cameras |
AV1 is the headline. A 1080p stream at quality-equivalent bitrate runs at roughly 5 Mbps on AV1 vs 7.5 Mbps on HEVC vs 12 Mbps on H.264. For a 200-MAU SaaS video platform serving 50,000 hours per month, the bandwidth saving over HEVC is significant. The 4090 is also the cheapest card with NVENC AV1; if your pipeline needs AV1 encode at scale, the 4090 is the floor and the L40S is the next step up at roughly 4x the price.
Throughput numbers per codec and resolution
Both NVENC blocks run in parallel automatically when invoked through FFmpeg’s -init_hw_device cuda path or DeepStream’s encoder plugin. Below are stream counts at typical bitrates, measured with nvenc-perf on a 4090 running CUDA 12.4 with no other workload.
| Codec | Resolution | FPS each | Concurrent streams | Wall clock margin |
|---|---|---|---|---|
| H.264 encode | 1080p | 30 | ~80 | ~95% NVENC pair busy |
| H.265 encode | 1080p | 30 | ~62 | ~92% busy |
| AV1 encode | 1080p | 30 | ~36 | ~88% busy |
| H.264 encode | 4K | 30 | ~14 | ~90% busy |
| H.265 encode | 4K | 60 | ~6 | ~94% busy |
| AV1 encode | 4K | 60 | ~5 | ~91% busy |
| H.264 decode | 1080p | 30 | ~250 | NVDEC saturated |
| H.265 decode | 4K | 60 | ~22 | NVDEC saturated |
| AV1 decode | 4K | 60 | ~18 | NVDEC saturated |
| JPEG decode (NVJPEG) | 2048×2048 | n/a | ~2400 images/s | For PyTorch DataLoader |
The asymmetry between encode and decode counts (250 vs 80 at 1080p H.264) reflects that decode is computationally cheaper per frame. NVDEC is the bottleneck for ingest-heavy pipelines (think 100+ camera streams); NVENC is the bottleneck for transcoding fan-out. For mixed pipelines (decode + AI + encode at the same fps) plan around the encoder count: 36 concurrent 1080p AV1 encodes is the practical ceiling.
AI video pipeline pattern (zero-copy)
The canonical Ada video-AI pipeline keeps frames in VRAM end-to-end: ingest RTSP or RTMP -> NVDEC -> CUDA frame buffer -> tensor core inference -> OSD overlay -> NVENC -> egress. The NVDEC frame stays in VRAM the whole time; you never round-trip to host memory. With NVIDIA DeepStream 7.0 or PyAV + CuPy you can sustain the workloads below.
| Workload | Streams | Model | Tensor util | End-to-end latency |
|---|---|---|---|---|
| Object detection (YOLOv8m) | 32x 1080p30 | FP16 | 62% | ~110 ms |
| Face recognition (ArcFace + RetinaFace) | 48x 720p30 | FP16 | 40% | ~140 ms |
| ANPR (LPRNet + YOLO) | 64x 1080p30 | FP16 | 35% | ~95 ms |
| Whisper Turbo audio extract | 20x realtime | FP16 | 50% | ~280 ms (chunk) |
| VLM frame caption (LLaVA 7B FP8) | 4x 1fps | FP8 | 72% | ~750 ms |
| Semantic segmentation (SegFormer-b2) | 16x 1080p30 | FP16 | 55% | ~85 ms |
The mixing rule of thumb: budget 60 percent of tensor utilisation as the safe ceiling when sharing a card with NVENC/NVDEC at high counts, because the SMs occasionally pause on memory access conflicts with the encoder DMA. For pure inference workloads the ceiling is closer to 85 percent. The video-AI mix is the right pattern when you need to do both, not when you can split them across hosts.
FFmpeg + CUDA + DeepStream recipes
The FFmpeg recipe that keeps everything on the GPU and the encoder pinned to NVENC instance 0:
# Decode + scale + AV1 encode on GPU, no host round-trip
ffmpeg -hwaccel cuda -hwaccel_output_format cuda \
-i rtsp://camera1/stream \
-vf "scale_cuda=1920:1080:format=yuv420p" \
-c:v av1_nvenc \
-preset p4 \ # quality preset, p1 fastest p7 best
-tune ll \ # low-latency for live
-b:v 5M -maxrate 6M -bufsize 10M \
-gpu 0 \ # pin to GPU 0
-f rtsp rtsp://server/encoded1
# Two parallel streams, one per NVENC instance (auto-balance)
ffmpeg -hwaccel cuda -i in1.mp4 -c:v hevc_nvenc -gpu 0 out1.mp4 &
ffmpeg -hwaccel cuda -i in2.mp4 -c:v hevc_nvenc -gpu 0 out2.mp4 &
# FFmpeg will pick NVENC 0 and NVENC 1 automatically
For PyTorch-side ingest, NVIDIA’s torchaudio.io.StreamReader with "hw_accel": "cuda:0" binds NVDEC frames straight into a tensor without going through host memory:
# PyTorch zero-copy ingest from RTSP, frames straight into CUDA tensor
import torchaudio
import torch
reader = torchaudio.io.StreamReader("rtsp://camera1/stream")
reader.add_video_stream(
frames_per_chunk=1,
decoder="h264_cuvid", # NVDEC H.264
hw_accel="cuda:0",
)
for chunk in reader.stream():
frame = chunk[0] # already on CUDA
# tensor is (1, 3, H, W) uint8 on cuda:0
with torch.no_grad():
detections = yolo_model(frame.float() / 255.0)
For DeepStream the equivalent is the standard reference pipeline (deepstream-app with a config file) that wires nvurisrcbin -> nvinfer -> nvtracker -> nvmsgconv -> nvmsgbroker. DeepStream 7.0 ships with TensorRT 10 and supports FP8 inference on Ada out of the box, which gets the YOLOv8m count up from 32 to 48 streams.
Real workload mixes
The 4090 NVENC + NVDEC + tensor mix that works in production:
| Use case | Decode | Inference | Encode | Card budget |
|---|---|---|---|---|
| City-scale ANPR | 64x 1080p30 H.265 | LPRNet INT8 | none (events only) | 1 card |
| Live sports overlay | 4x 4K60 HEVC | YOLOv8m FP16 + tracker | 4x 4K60 AV1 | 1 card |
| VOD AI tagging | 20x 1080p HEVC | VLM caption (LLaVA 7B FP8) | none (metadata only) | 1 card |
| Conference recording | 8x 1080p30 + 8x audio | Whisper Turbo INT8 | 8x AV1 archive | 1 card |
| Camera grid retail | 32x 1080p30 | YOLOv8m FP16 + ArcFace | 32x 720p H.264 | 1 card |
For a 200-MAU SaaS video platform doing live transcode with AI tagging on the side, a single 4090 handles 32 input streams at 1080p30 with YOLO detection and AV1 transcode, all on one card, at a UK hosted cost around £329/month. The same workload on cloud (AWS Elemental + SageMaker) lands at roughly 5x the monthly cost. See the cloud H100 comparison for the broader picture.
Production gotchas
- Consumer GeForce drivers cap NVENC sessions at 8 by default. Linux datacentre drivers and the patched session policy on gigagpu.com/ images remove the cap. Verify with
nvidia-smi -q -d UTILIZATIONshowing per-session counts. - AV1 preset above p5 collapses encode throughput. Preset p7 is offline-only; for live use stick to p3-p5. The quality difference at 5 Mbps is small.
- NVDEC saturates before SM does. Decoding 250+ 1080p streams maxes the single NVDEC engine; if you need more decode capacity you need a second card, not a faster one.
- RTSP timeout defaults are wrong. Set
-rtsp_transport tcp -timeout 30000000for production reliability against flaky cameras. - FFmpeg hwaccel_output_format cuda is mandatory for zero-copy. Without it, FFmpeg silently copies frames back to host between filters, killing throughput.
- HEVC encode at low bitrate has noticeable banding. Below 3 Mbps for 1080p, switch to AV1 or accept the artefacts.
- DeepStream’s RGBA conversion eats VRAM. Each 1080p RGBA frame is 8 MB; 32 streams x 4 frames in the pipeline = 1 GB. Account for it in your
--gpu-memory-utilizationceiling.
Verdict and when to pick the 4090 for video-AI
The 4090 24GB is the cheapest single card that does AV1 encode and tensor inference simultaneously at production scale. Pick it for live transcode plus AI overlay (sports, retail, security), VOD tagging at scale (VLM frame captioning, search index generation), conference-grade recording with Whisper transcription, and any workload that needs 30+ concurrent decode streams plus an AI model on the same frames. Skip it if you need more than 32 high-quality AV1 1080p encodes per card (move to L40S or two 4090s in data parallel), if your inference workload demands more than 24 GB VRAM (move to a 5090 32GB or RTX 6000 Pro), or if you need ECC video memory for broadcast-grade reliability (RTX 6000 Ada).
For a 12-engineer video product team building a live AI overlay product, a single 4090 is the right starting point: it handles the realistic workload mix (8-16 streams at 1080p30 with YOLOv8 detection and AV1 encode) at a fixed monthly cost rather than per-second cloud billing surprises. See the monthly hosting cost piece for the full breakdown and spec breakdown for the surrounding silicon detail.
Hardware AV1 + AI on one card, hosted in the UK
NVENC, NVDEC and tensor cores all hot, DeepStream 7.0 and CUDA 12.4 pre-built. UK dedicated hosting.
Order the RTX 4090 24GBSee also: spec breakdown, TFLOPS class, power draw efficiency, vLLM setup, thermal performance, monthly hosting cost, vs cloud H100.