Table of Contents
Most teams over-spec the GPU for embeddings. BGE-large is 330M params; nomic-embed is 137M; even the largest reranker (ColBERT) is <500M. None of these need a flagship GPU — they need throughput-per-pound. The right card is meaningfully smaller and cheaper than your LLM host.
For embedding-only deployments, the RTX 3060 12 GB at £99/mo is the cost leader — ~50K embeddings/sec, plenty of VRAM headroom. Step up to a 5060 Ti 16 GB only if you also want a reranker hot-loaded. Anything bigger is over-spec.
Why embeddings need different sizing
Embedding workloads have a different bottleneck than LLM inference:
- Tiny models — BGE-large is ~1.3 GB FP16. Fits anywhere.
- High throughput per pass — embeddings batch much better than autoregressive generation.
- Memory-bandwidth bound — token embeddings are read-once-write-once. Bandwidth matters more than compute.
- No KV cache — single forward pass per input.
This means a card with modest VRAM but solid memory bandwidth (3060 12 GB at 360 GB/s) outperforms a flagship card on cost-per-embedding.
GPU ranking for embeddings
| Rank | GPU | Throughput (BGE-large) | Cost per 1M embeds (60% util) | Notes |
|---|---|---|---|---|
| #1 | RTX 3060 12 GB | ~48K/s | £0.0011 | Cost leader |
| #2 | RTX 5060 Ti 16 GB | ~62K/s | £0.0014 | Bandwidth-uplift |
| #3 | RTX 3050 6 GB | ~28K/s | £0.0014 | Cheapest hardware |
| #4 | RTX 4060 8 GB | ~38K/s | £0.0014 | Newer arch |
| #5 | RTX 5080 16 GB | ~95K/s | £0.0014 | Higher absolute capacity |
| #6 | RTX 5090 32 GB | ~135K/s | £0.0014 | Wasted capacity for embeds-only |
| #7 | RTX 6000 Pro | ~145K/s | £0.0040 | Dramatically over-spec |
Cost-per-embedding is roughly flat across the bottom of the catalogue — pick by throughput required.
Real throughput numbers
vLLM 0.6.3 with BGE-large-en-v1.5, batch size 64, sequence length 512:
| GPU | Embeds/sec (BGE-large) | Embeds/sec (nomic-embed) | Latency p99 |
|---|---|---|---|
| RTX 3050 6 GB | ~28K | ~52K | 24 ms |
| RTX 3060 12 GB | ~48K | ~85K | 14 ms |
| RTX 4060 8 GB | ~38K | ~68K | 17 ms |
| RTX 5060 Ti 16 GB | ~62K | ~108K | 11 ms |
| RTX 5080 16 GB | ~95K | ~165K | 8 ms |
| RTX 5090 32 GB | ~135K | ~230K | 6 ms |
Verdict
- Cost-anchored embedding-only deployment: RTX 3060 12 GB at £99/mo. Hosts BGE-large + reranker comfortably.
- Multi-model + embedding (LLM on same card): RTX 5060 Ti 16 GB or 5090 32 GB. Don’t put a 7B LLM and embeddings on a 12 GB card.
- High-throughput embedding farm: RTX 5090 — same cost-per-embedding as 3060 but 3× the absolute capacity.
Bottom line
Embeddings are the rare workload where the cheapest GPU genuinely is the best. Don't waste a 5090 on it unless the same card is also serving an LLM. For RAG architectures, run embeddings on a small GPU and LLM on a separate bigger card.