Table of Contents
Why Embedding Generation Needs GPU Acceleration
Embedding models convert text into dense vector representations used by search engines, RAG pipelines, and recommendation systems. While individual embedding calls are fast, production workloads involve millions of documents. Generating embeddings for a 10-million-document corpus on CPU can take days; on a dedicated GPU server the same job finishes in hours.
GigaGPU’s infrastructure supports running embedding models alongside LLMs on the same GPU. Whether you are building indexes for FAISS, Qdrant, Weaviate, or ChromaDB, this guide helps you pick the right GPU for your embedding throughput and budget requirements.
Embedding Model Overview: BERT, E5, BGE
The three most popular embedding model families differ in size, quality, and GPU requirements. All are significantly faster on GPU than CPU.
| Model | Parameters | Dimensions | VRAM (FP16) | Best For |
|---|---|---|---|---|
| BERT-base | 110M | 768 | ~0.5 GB | Legacy pipelines, fine-tuned models |
| E5-large-v2 | 335M | 1024 | ~1.2 GB | High-quality retrieval |
| BGE-large-en-v1.5 | 335M | 1024 | ~1.2 GB | RAG pipelines, LlamaIndex/LangChain |
| BGE-small-en-v1.5 | 33M | 384 | ~0.2 GB | Low-latency, edge deployment |
| E5-mistral-7b-instruct | 7B | 4096 | ~14 GB | Highest quality, compute-heavy |
Most production RAG deployments use BGE-large or E5-large as the best balance of quality and speed. For the full RAG stack, see our best GPU for RAG pipelines guide.
Embedding Throughput Benchmarks by GPU
We benchmarked three popular models encoding 256-token passages at optimal batch size. Throughput is measured in passages per second.
BGE-large-en-v1.5 (335M params)
| GPU | VRAM | Passages/sec (bs=64) | Passages/sec (bs=256) | Server $/hr |
|---|---|---|---|---|
| RTX 5090 | 32 GB | 2,840 | 3,460 | $1.80 |
| RTX 5080 | 16 GB | 1,920 | 2,310 | $0.85 |
| RTX 3090 | 24 GB | 1,380 | 1,720 | $0.45 |
| RTX 4060 Ti | 16 GB | 980 | 1,180 | $0.35 |
| RTX 4060 | 8 GB | 620 | 740 | $0.20 |
| RTX 3050 | 8 GB | 310 | 370 | $0.10 |
E5-large-v2 (335M params)
| GPU | Passages/sec (bs=64) | Passages/sec (bs=256) |
|---|---|---|
| RTX 5090 | 2,780 | 3,390 |
| RTX 5080 | 1,870 | 2,260 |
| RTX 3090 | 1,350 | 1,680 |
| RTX 4060 Ti | 960 | 1,150 |
| RTX 4060 | 605 | 720 |
| RTX 3050 | 300 | 360 |
BERT-base (110M params)
| GPU | Passages/sec (bs=64) | Passages/sec (bs=256) |
|---|---|---|
| RTX 5090 | 5,200 | 6,850 |
| RTX 5080 | 3,510 | 4,620 |
| RTX 3090 | 2,540 | 3,350 |
| RTX 4060 Ti | 1,810 | 2,380 |
| RTX 4060 | 1,140 | 1,500 |
| RTX 3050 | 570 | 750 |
Cost per Million Embeddings
We calculated the cost to embed one million 256-token passages using BGE-large at optimal batch size, assuming sustained throughput on a dedicated server.
| GPU | Time for 1M Passages | Cost (1M Embeddings) | OpenAI Equivalent |
|---|---|---|---|
| RTX 5090 | 4.8 min | $0.144 | $0.10* |
| RTX 5080 | 7.2 min | $0.102 | $0.10 |
| RTX 3090 | 9.7 min | $0.073 | $0.10 |
| RTX 4060 Ti | 14.1 min | $0.082 | $0.10 |
| RTX 4060 | 22.5 min | $0.075 | $0.10 |
| RTX 3050 | 45.0 min | $0.075 | $0.10 |
*OpenAI text-embedding-3-small at $0.02/1M tokens; 256 tokens per passage = ~$0.10 per 1M passages.
Self-hosting is cost-competitive even at relatively small scale, and the gap widens as volume increases. See our cost calculator for interactive estimates.
Batch Size Scaling and VRAM Usage
Larger batch sizes improve throughput but consume more VRAM. The embedding model weights are small, so batch size is the main VRAM driver. This matters when co-locating with an LLM for RAG pipelines.
| Batch Size | VRAM (BGE-large) | Throughput Gain vs bs=1 |
|---|---|---|
| 1 | ~1.4 GB | 1x (baseline) |
| 32 | ~2.1 GB | ~8x |
| 64 | ~2.8 GB | ~12x |
| 256 | ~5.2 GB | ~15x |
| 512 | ~8.4 GB | ~16x |
Integrating with RAG Pipelines and Vector Databases
Embeddings feed into vector databases for similarity search. The GPU handles embedding generation while the vector store handles indexing and retrieval. Popular pairings include BGE-large with Qdrant for filtered search, FAISS for raw speed, and ChromaDB for simplicity. See our vector database comparison for detailed trade-offs.
For the complete RAG orchestration layer, pair your embedding GPU with LangChain or LlamaIndex. Both frameworks support local embedding endpoints natively.
GPU Recommendations
Best overall: RTX 3090. At $0.45/hr the RTX 3090 embeds one million passages for $0.073, cheaper than OpenAI’s API. The 24 GB VRAM handles large batch sizes and still leaves room for an LLM on the same card.
Best for high-volume indexing: RTX 5090. If you regularly re-index millions of documents, the 5090’s 3,460 passages/sec at large batch sizes cuts indexing time significantly. Worth the premium for production RAG pipelines with frequent updates.
Best budget: RTX 4060. Embeds one million passages in 22 minutes for $0.075. Good for development and moderate-scale production workloads.
Best for co-located stacks: RTX 5080. The 16 GB VRAM supports running BGE-large alongside a quantised 7B LLM, keeping your entire LlamaIndex or LangChain stack on a single GPU.
Run Embedding Models on Dedicated GPUs
GigaGPU servers support sentence-transformers, TEI, and custom embedding endpoints. Generate millions of embeddings without API rate limits or per-token fees.
Browse GPU Servers