Home / Blog / Benchmarks / Stable Diffusion: Concurrent Image Generation by GPU

Benchmarks

Stable Diffusion: Concurrent Image Generation by GPU

How many images can each GPU generate concurrently with Stable Diffusion? Batch throughput benchmarks for SDXL, SD 1.5, and FLUX.1 across six GPUs.

Benchmarks April 17, 2026 3 min read gigagpu

Table of Contents

Concurrent Generation Overview
SDXL Batch Throughput by GPU
SD 1.5 Batch Throughput by GPU
FLUX.1 Throughput by GPU
Queue vs Batch: Throughput Strategy
Conclusion

Concurrent Generation Overview

Image generation APIs need to handle multiple requests simultaneously. Unlike LLM inference, diffusion models can process multiple images in a single batch through the denoising loop, trading per-image latency for higher aggregate throughput. We tested batch image generation across six GPUs on dedicated GPU servers to measure how many images per minute each card can produce.

Tests ran on GigaGPU bare-metal servers using the diffusers library with default step counts (20 for SD 1.5, 30 for SDXL, 28 for FLUX.1). We measured images per minute at batch sizes 1, 2, 4, and 8 (where VRAM permits). For per-image latency, see the image generation latency benchmark.

SDXL Batch Throughput by GPU

SDXL at 1024×1024, 30 steps. Images per minute at each batch size.

GPU	Batch 1 (img/min)	Batch 2 (img/min)	Batch 4 (img/min)	Batch 8 (img/min)
RTX 3050 (6 GB)	1.4	OOM	OOM	OOM
RTX 4060 (8 GB)	3.2	4.8	OOM	OOM
RTX 4060 Ti (16 GB)	4.7	7.4	10.8	OOM
RTX 3090 (24 GB)	7.3	11.6	17.2	22.0
RTX 5080 (16 GB)	10.9	17.4	25.6	OOM
RTX 5090 (32 GB)	17.6	28.2	42.0	55.0

The RTX 5090 produces 55 SDXL images per minute at batch 8 — nearly 80,000 images per day. The RTX 3090 manages 22 images per minute at batch 8, which is roughly 31,000 per day. The RTX 4060 is limited to batch 2 for SDXL due to VRAM constraints.

SD 1.5 Batch Throughput by GPU

SD 1.5 at 512×512, 20 steps. The lighter model allows larger batch sizes.

GPU	Batch 1 (img/min)	Batch 2 (img/min)	Batch 4 (img/min)	Batch 8 (img/min)
RTX 3050 (6 GB)	7.1	10.8	15.2	OOM
RTX 4060 (8 GB)	15.8	25.2	36.0	44.0
RTX 4060 Ti (16 GB)	23.1	37.8	56.0	72.0
RTX 3090 (24 GB)	35.3	58.4	88.0	114.0
RTX 5080 (16 GB)	54.5	90.0	134.0	172.0
RTX 5090 (32 GB)	85.7	142.0	212.0	280.0

SD 1.5 is dramatically faster — the RTX 5090 produces 280 images per minute at batch 8. Even the RTX 4060 handles 44 images per minute, making it viable for lightweight image generation APIs.

FLUX.1 Throughput by GPU

FLUX.1 (dev) at 1024×1024, 28 steps. FLUX.1’s higher VRAM requirements limit batch sizes on smaller cards.

GPU	Batch 1 (img/min)	Batch 2 (img/min)	Batch 4 (img/min)
RTX 4060 Ti (16 GB)	2.1	OOM	OOM
RTX 3090 (24 GB)	3.8	5.8	OOM
RTX 5080 (16 GB)	5.4	OOM	OOM
RTX 5090 (32 GB)	8.8	14.0	20.4

FLUX.1 is VRAM-hungry — only the RTX 5090 supports batch 4. For FLUX.1 production APIs, the 5090 is effectively the minimum card for any meaningful throughput.

Queue vs Batch: Throughput Strategy

For image generation APIs, you have two strategies: queue individual requests (batch 1, lowest latency) or accumulate requests into batches (higher throughput, higher latency). Queue mode delivers images in 3-18 seconds depending on GPU and model. Batch mode can double or triple throughput but adds wait time while the batch fills.

A common production approach is dynamic batching with a short timeout (200-500 ms). If multiple requests arrive within the timeout window, they batch together; otherwise, they process individually. This balances throughput and latency automatically. For overall API capacity planning, see the GPU capacity planning for AI SaaS guide. For more image generation analysis, explore the Benchmarks category.

Conclusion

Concurrent image generation throughput depends heavily on GPU VRAM and model size. The RTX 5090 leads with 55 SDXL images per minute at batch 8, while the RTX 3090 delivers 22 — both strong options for production image APIs. For SD 1.5 workloads, even budget GPUs offer high throughput. Match your GPU choice to your model, batch strategy, and volume requirements at GigaGPU dedicated hosting. See also the RTX 3090 vs RTX 5090 throughput per dollar comparison and the GPU comparisons category.

Size Your GPU Server

Tell us your workload — we’ll recommend the right GPU.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Benchmarks

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Stable Diffusion: Concurrent Image Generation by GPU

Concurrent Generation Overview

SDXL Batch Throughput by GPU

SD 1.5 Batch Throughput by GPU

FLUX.1 Throughput by GPU

Queue vs Batch: Throughput Strategy

Conclusion

Size Your GPU Server

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Stable Diffusion: Concurrent Image Generation by GPU

Concurrent Generation Overview

SDXL Batch Throughput by GPU

SD 1.5 Batch Throughput by GPU

FLUX.1 Throughput by GPU

Queue vs Batch: Throughput Strategy

Conclusion

Size Your GPU Server

Need a Dedicated GPU Server?

gigagpu

Related Articles

How Many OCR Pages per Minute per GPU?

Llama 3 8B Benchmark on the RTX 5060 Ti 16 GB

DeepSeek 7B on RTX 5080: Performance Benchmark & Cost, Category: Benchmarks, Slug: deepseek-7b-on-rtx-5080-benchmark, Excerpt: DeepSeek 7B benchmarked on RTX 5080: 68.0 tok/s at FP16, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

PaddleOCR Benchmark on the RTX 5060 Ti 16 GB

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?