6 GB cards (RTX 3050, GTX 1660, laptop 4050, 4060 8GB with low effective VRAM) still power a surprising amount of real AI work: edge devices, classification-at-scale boxes, hobbyist workstations, and low-traffic inference boxes. This guide lists what fits in 6 GB, what definitively does not, and where the upgrade ramp to a 16 GB RTX 5060 Ti pays off, all in the context of our UK dedicated GPU hosting.
Contents
LLMs that fit in 6 GB
The 6 GB line places a hard ceiling at roughly 3B parameters FP16, 7B at AWQ INT4, or small models at FP8 on supporting cards. Practical recommendations:
| Model | Params | Precision | Weights | Max context (6 GB) |
|---|---|---|---|---|
| TinyLlama 1.1B | 1.1B | FP16 | 2.2 GB | 16k |
| Llama 3.2 1B | 1.2B | FP16 | 2.5 GB | 128k (KV dominant) |
| Llama 3.2 3B | 3.2B | FP16 | 6.4 GB | Tight; use FP8/AWQ |
| Llama 3.2 3B | 3.2B | AWQ INT4 | 2.2 GB | 32k |
| Phi-3 mini 3.8B | 3.8B | FP8 | 3.8 GB | 8k |
| Phi-3 mini 3.8B | 3.8B | AWQ INT4 | 2.6 GB | 128k |
| Gemma 2 2B | 2.6B | FP16 | 5.2 GB | 2k-4k |
| Gemma 2 2B | 2.6B | AWQ INT4 | 1.8 GB | 8k |
| Qwen 2.5 1.5B | 1.5B | FP16 | 3.0 GB | 32k |
| Qwen 2.5 3B | 3.1B | AWQ INT4 | 2.1 GB | 32k |
Image models that fit
| Model | VRAM | Typical speed (6 GB card) | Fits? |
|---|---|---|---|
| Stable Diffusion 1.5 | ~3.5 GB | 1.5 s/image at 512px, 20 steps (RTX 3050) | Yes |
| SD 1.5 + ControlNet | ~4.5 GB | 2.5 s/image | Yes |
| SDXL base 1.0 | ~9 GB (needs 8+) | n/a | No (offload only) |
| SDXL Turbo (1-step) | ~7 GB | n/a | Marginal, not recommended |
| SD 3 Medium | ~11 GB | n/a | No |
| FLUX.1 schnell | ~23 GB FP16 | n/a | No |
Other useful models
| Model | Purpose | VRAM | Throughput on 6 GB |
|---|---|---|---|
| Whisper large-v3 | Speech-to-text | ~3.2 GB (FP16) | ~8x real time |
| Whisper medium | Speech-to-text | ~1.5 GB | ~20x real time |
| Silero VAD | Voice activity detection | ~150 MB | Real-time on CPU or GPU |
| BGE-M3 embeddings | Dense + sparse embeddings | ~2.2 GB | ~300 docs/sec |
| E5-large-v2 | Dense embeddings | ~1.4 GB | ~500 docs/sec |
| YOLOv8n / v8s TRT | Object detection | ~300 MB | 500-700 FPS |
| NLLB-200-distilled-600M | Translation | ~2.5 GB | ~1,200 tokens/s bs=32 |
| DistilBERT-base | Classification | ~300 MB | ~4,000 samples/s |
What definitively does not fit
- Llama 3.1 8B at any precision beyond AWQ INT4 with 2k context (5.4 GB + KV overflows 6 GB at 4k).
- Mistral 7B, Qwen 2.5 7B: same story, AWQ only and with painful context limits.
- Gemma 2 9B: weights alone exceed 6 GB at AWQ once KV is considered.
- Qwen 2.5 14B, Gemma 2 27B, Llama 3.1 70B: nowhere near.
- SDXL, SD3, FLUX: even SDXL with offload is painful (12+ seconds per image).
- Multimodal LLMs: LLaVA-1.6 13B, Qwen2-VL 7B, InternVL 8B all need 10 GB+.
Use cases where 6 GB still wins
- Edge inference: on-device classification, YOLO for retail cameras, Whisper medium for call centre transcription.
- Hobbyist SD 1.5 workflows: ControlNet, LoRA training at 512px is feasible.
- Classification at scale: DistilBERT, E5 embeddings, BGE-M3 at 500+ docs/s.
- Small LLM tasks: Phi-3 mini or Llama 3.2 3B for routing, summarisation, and tool-call dispatch.
When to step up to 16 GB
If any of the following apply, a 16 GB card such as the 5060 Ti is the right next rung:
- You want 8B LLMs at FP8 (see 8B VRAM requirements).
- You need SDXL or FLUX schnell.
- You want to run Gemma 2 9B at FP8 (see Gemma 2 guide).
- You need 30+ simultaneous YOLO streams (see YOLO guide).
- You need 14B coder models like Qwen 2.5 Coder 14B (see Qwen Coder 14B guide).
Outgrown 6 GB? Step up to 16 GB without overpaying.
Blackwell, FP8, 16 GB GDDR7, 180 W. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: 8B LLM VRAM requirements, max model size on 5060 Ti, Gemma 2 guide, YOLO guide, computer vision hosting.