The sentence-transformers library defaults to batch_size=32. On dedicated GPU hosting that is far too low for almost every modern embedding model. Raising it typically yields 5-10x throughput at zero cost.
Contents
Why Default Is Low
sentence-transformers targets users on laptops and older GPUs. Default batch 32 works everywhere. On a modern dedicated GPU with 16-96 GB VRAM, 32 is leaving nearly all compute idle.
Tune
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-m3", device="cuda")
embeddings = model.encode(
documents,
batch_size=256,
show_progress_bar=True,
convert_to_numpy=True,
)
Start at 128 and double until you hit OOM or throughput plateaus. On a 24 GB card with BGE-M3, 512 is often the sweet spot. On a 4060 Ti 16 GB, 256.
Numbers
BGE-M3 on a 3090 24 GB:
| Batch | Docs/sec |
|---|---|
| 32 (default) | ~800 |
| 128 | ~2,800 |
| 256 | ~4,400 |
| 512 | ~5,800 |
| 1024 | OOM |
When to Switch to TEI
For one-shot batch indexing, sentence-transformers with a big batch is fine. For an HTTP embedding service serving production queries, Text Embeddings Inference (TEI) is faster and has built-in dynamic batching for heterogeneous request sizes. See BGE-M3 self-hosted for the TEI setup.
High-Throughput Embedding Jobs
UK dedicated GPUs sized to run big embedding batches without OOM.
Browse GPU Servers