Table of Contents
The AI Content Creation Stack
Content teams increasingly need both AI text generation and image generation. Blog posts, social media, marketing materials, and product descriptions all benefit from combining LLMs for text with diffusion models for visuals. Running both on a single dedicated GPU server is cost-effective and simplifies your infrastructure.
The challenge is fitting both models into one GPU’s VRAM while maintaining acceptable generation speeds. With the right model selection and memory management, a single 24 GB GPU handles both workloads efficiently. Explore more content-focused setups in our use cases section.
Model Selection: Image and Text
Choose models that balance quality with VRAM efficiency.
| Task | Model | VRAM (loaded) | Generation Speed |
|---|---|---|---|
| Text generation | Llama 3 8B (AWQ 4-bit) | ~4.5 GB | ~90 tok/s |
| Text generation | Mistral 7B (AWQ 4-bit) | ~4 GB | ~95 tok/s |
| Image generation | SDXL 1.0 (FP16) | ~6.5 GB | ~8s per 1024×1024 |
| Image generation | FLUX.1 schnell (FP16) | ~12 GB | ~4s per 1024×1024 |
| Image generation | SD 1.5 (FP16) | ~3.5 GB | ~3s per 512×512 |
For content creation, SDXL strikes the best balance between image quality and VRAM usage. Paired with a 4-bit 7-8B text model, both fit comfortably on a 24 GB GPU. Check text model performance on our tokens per second benchmark.
VRAM Planning for Dual Workloads
Running both models requires careful VRAM budgeting on an RTX 3090 (24 GB).
| Component | VRAM Used | Running Total |
|---|---|---|
| Llama 3 8B (AWQ 4-bit) weights | 4.5 GB | 4.5 GB |
| KV cache (batch 4, 2K context) | 1.5 GB | 6.0 GB |
| SDXL weights (FP16) | 6.5 GB | 12.5 GB |
| SDXL working memory (1 image) | 3.0 GB | 15.5 GB |
| PyTorch overhead + CUDA context | 2.0 GB | 17.5 GB |
| Available headroom | 6.5 GB | 24 GB total |
With 6.5 GB headroom, you have room for larger batch sizes or higher resolution image generation. If VRAM is tight, consider loading models on demand rather than simultaneously. For memory management techniques, see our vLLM memory optimisation guide.
Single-GPU Architecture
Option A: Both models loaded simultaneously. Keep both the LLM and diffusion model in VRAM at all times. Requests route to the appropriate model based on type. This approach has zero model loading latency but uses more VRAM. Best for workflows that alternate rapidly between text and image generation.
Option B: Dynamic model swapping. Load only the active model into VRAM. When switching from text to image generation, offload the LLM weights to CPU RAM and load the diffusion model. Swap time is 5-15 seconds on NVMe storage. Best for batch workflows (generate all text first, then all images).
For most content creation pipelines, Option A is preferred. The always-ready architecture supports interactive workflows where a content creator generates text, requests an image, refines text, and generates another image. Run the text model via vLLM or Ollama, and the image model via ComfyUI or a custom Diffusers API.
Workflow Setup and Configuration
Here is a practical setup for running both models on a single server.
# Terminal 1: Start text generation (vLLM)
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-3-8B-AWQ \
--quantization awq \
--gpu-memory-utilization 0.30 \
--max-model-len 2048 \
--port 8000
# Terminal 2: Start image generation (Diffusers API)
python image_server.py \
--model stabilityai/stable-diffusion-xl-base-1.0 \
--port 8001
# Both services share the same GPU
# vLLM uses ~6 GB, SDXL uses ~10 GB, ~8 GB headroom
Set --gpu-memory-utilization 0.30 on vLLM to limit its VRAM allocation, leaving room for the image model. This is lower than the default 0.90 but sufficient for content-length text generation at moderate batch sizes.
For content teams needing an all-in-one interface, tools like Open WebUI connect to both endpoints and provide a unified chat and image generation experience. For API-first setups, see our API hosting options.
When to Scale Beyond One GPU
A single GPU handles content creation workflows for small to medium teams (1-5 concurrent users). Scale to a second GPU when:
- Image generation queues exceed 30 seconds during peak usage
- You need to generate images and text simultaneously for multiple users
- You want to upgrade to FLUX.1 (12 GB) alongside a larger text model
- Production SLAs require sub-5-second image generation consistently
With a second GPU, dedicate one to text and one to images. This eliminates resource contention and lets each model use full VRAM. Explore multi-GPU clusters for this setup.
Compare the cost of self-hosted dual-model serving against using separate API services (OpenAI for text, Stability for images) with the GPU vs API cost comparison. At even moderate usage, a single dedicated GPU server is dramatically cheaper. Use the LLM cost calculator for precise estimates.
One Server for All Your AI Content Needs
Run text and image generation on a single GigaGPU dedicated server. UK-hosted, 24 GB VRAM, ready for production content workflows.
Browse GPU Servers