RTX 3050 - Order Now
Home / Blog / Tutorials / Sentence Transformers GPU Batch Tuning
Tutorials

Sentence Transformers GPU Batch Tuning

sentence-transformers defaults to tiny batches. On a dedicated GPU, bigger batches deliver 5-10x throughput - and the right number depends on the model.

The sentence-transformers library defaults to batch_size=32. On dedicated GPU hosting that is far too low for almost every modern embedding model. Raising it typically yields 5-10x throughput at zero cost.

Contents

Why Default Is Low

sentence-transformers targets users on laptops and older GPUs. Default batch 32 works everywhere. On a modern dedicated GPU with 16-96 GB VRAM, 32 is leaving nearly all compute idle.

Tune

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("BAAI/bge-m3", device="cuda")
embeddings = model.encode(
    documents,
    batch_size=256,
    show_progress_bar=True,
    convert_to_numpy=True,
)

Start at 128 and double until you hit OOM or throughput plateaus. On a 24 GB card with BGE-M3, 512 is often the sweet spot. On a 4060 Ti 16 GB, 256.

Numbers

BGE-M3 on a 3090 24 GB:

BatchDocs/sec
32 (default)~800
128~2,800
256~4,400
512~5,800
1024OOM

When to Switch to TEI

For one-shot batch indexing, sentence-transformers with a big batch is fine. For an HTTP embedding service serving production queries, Text Embeddings Inference (TEI) is faster and has built-in dynamic batching for heterogeneous request sizes. See BGE-M3 self-hosted for the TEI setup.

High-Throughput Embedding Jobs

UK dedicated GPUs sized to run big embedding batches without OOM.

Browse GPU Servers

See embedding throughput benchmark.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?