Embedding workloads scale with the card, but not the way you might expect from LLM benchmarks. At small embedding model sizes, even budget cards keep up. On our dedicated GPU hosting here are measured throughput numbers for common embedders.
Contents
Setup
TEI v1.5 Docker container, FP16, 200-token input per document, batch size tuned per card to saturate VRAM.
BGE-M3 (568M)
| GPU | Batch | Docs/sec (dense) |
|---|---|---|
| RTX 3050 6GB | 64 | ~1,400 |
| RTX 4060 8GB | 96 | ~2,100 |
| RTX 4060 Ti 16GB | 256 | ~3,400 |
| RTX 3090 | 512 | ~7,200 |
| RTX 5080 | 384 | ~9,800 |
| RTX 5090 | 768 | ~16,000 |
| RTX 6000 Pro | 2048 | ~28,000 |
BGE-large (335M)
| GPU | Docs/sec |
|---|---|
| RTX 3050 | ~2,000 |
| RTX 4060 Ti | ~5,500 |
| RTX 3090 | ~10,500 |
| RTX 5080 | ~14,000 |
| RTX 5090 | ~22,000 |
| RTX 6000 Pro | ~42,000 |
Verdict
For embedding workloads under 10k docs/sec, the 4060 Ti 16GB is usually the right economic choice. For high-volume indexing (>20k docs/sec) the 5090 or 6000 Pro pays back. Do not provision a 6000 Pro for embedding-only workloads unless you are pushing 100M+ documents; use the freed budget for a second card or a larger LLM.
For the broader GPU selection see the 2026 tier ladder and VRAM per pound.
Right-Sized Embedding Hosting
We match GPU tier to your expected document throughput.
Browse GPU ServersSee batch tuning.