Yes, the RTX 3090 can run SDXL and a 7B LLM simultaneously. With 24GB GDDR6X VRAM, the RTX 3090 has enough capacity to load both an SDXL checkpoint for image generation and a quantised language model for text tasks. This makes it a versatile single-GPU solution for multi-modal AI workflows.
The Short Answer
YES. SDXL (~10.5GB) plus a 7B LLM in INT4 (~5GB) fits within 24GB with room for both to operate.
The key to running both models is VRAM budgeting. SDXL base in FP16 with a 1024×1024 generation pipeline consumes approximately 10.5GB peak. A 7B parameter LLM in INT4 (such as LLaMA 3 8B or Mistral 7B) needs about 5GB for weights plus 1-2GB for KV cache. Combined, that is roughly 17-18GB, leaving about 6GB of headroom on the RTX 3090.
The constraint is that you cannot run both models at maximum settings. The LLM should be quantised to INT4, and SDXL generation should stick to batch size 1. With this configuration, both workloads perform well enough for production use.
VRAM Analysis
| Combined Configuration | SDXL VRAM | LLM VRAM | Total | RTX 3090 (24GB) |
|---|---|---|---|---|
| SDXL + LLaMA 3 8B INT4 | ~10.5GB | ~7GB | ~17.5GB | Fits well |
| SDXL + Mistral 7B INT4 | ~10.5GB | ~6.5GB | ~17GB | Fits well |
| SDXL + LLaMA 3 8B INT8 | ~10.5GB | ~10.5GB | ~21GB | Tight |
| SDXL + LLaMA 3 8B FP16 | ~10.5GB | ~18GB | ~28.5GB | No |
| SDXL + DeepSeek R1 7B INT4 | ~10.5GB | ~6.5GB | ~17GB | Fits well |
The sweet spot is SDXL plus a 7B model in INT4. Both models stay fully in VRAM without offloading, which means switching between image generation and text inference is instantaneous, with no model loading delays. For the full picture on VRAM allocation, see our SDXL VRAM guide and LLaMA 3 VRAM requirements.
Performance Benchmarks
| Workload | RTX 3090 (Solo) | RTX 3090 (Combined) | Impact |
|---|---|---|---|
| SDXL 1024×1024 (20 steps) | ~2.9s / image | ~3.2s / image | ~10% slower |
| LLaMA 3 8B INT4 output | ~55 tok/s | ~48 tok/s | ~13% slower |
| Mistral 7B INT4 output | ~50 tok/s | ~43 tok/s | ~14% slower |
Running both models simultaneously incurs roughly a 10-15% performance penalty compared to running each alone. This is due to VRAM bandwidth sharing and reduced available memory for caching. Both workloads remain well within production-acceptable speeds. The penalty increases if you generate images while simultaneously running LLM inference. See detailed throughput on our benchmarks page.
Setup Guide
Run ComfyUI for SDXL and Ollama for the LLM as separate services:
# Terminal 1: Start Ollama with LLaMA 3 8B INT4
ollama run llama3:8b-instruct-q4_K_M
# Terminal 2: Start ComfyUI for SDXL
cd ComfyUI
python main.py --listen 0.0.0.0 --port 8188
Do NOT use the --lowvram flag for ComfyUI in this configuration, as it enables CPU offloading which is unnecessary and slows things down. Both models should stay fully resident in VRAM.
For a more integrated approach using an API layer:
# vLLM for the LLM with limited VRAM allocation
vllm serve meta-llama/Meta-Llama-3-8B-Instruct \
--quantization awq \
--max-model-len 4096 \
--gpu-memory-utilization 0.30 \
--host 0.0.0.0 --port 8000
Setting --gpu-memory-utilization 0.30 caps vLLM’s VRAM usage at roughly 7.2GB, leaving the rest for ComfyUI.
Recommended Alternative
If you want both models at higher precision or need to add more components (ControlNet, refiner, larger LLM), the RTX 5090 with 32GB provides the extra headroom. For the ultimate multi-model setup, see whether the RTX 5090 can run multiple LLMs at once.
For dedicated image generation, check the RTX 4060 Ti SDXL guide. For dedicated LLM work on the 3090, see the LLaMA 3 8B FP16 guide or CodeLlama 34B guide. Browse all multi-model configurations on our dedicated GPU servers page or compare in the best GPU for inference guide.
Deploy This Model Now
Dedicated GPU servers with the VRAM you need. UK datacenter, full root access.
Browse GPU Servers