Embedding generation is the backbone of RAG and semantic search. The RTX 5060 Ti 16GB at our hosting is a workhorse for this workload – high parallelism on small models, FP8 support.
Contents
Setup
- Text Embeddings Inference (TEI) 1.5
- Input: 256-token sentences, truncation default
- Metrics: texts per second (t/s)
Models
| Model | Params | Dim | Context | FP16 VRAM |
|---|---|---|---|---|
| BGE-small-en-v1.5 | 33M | 384 | 512 | 0.3 GB |
| BGE-base-en-v1.5 | 109M | 768 | 512 | 0.7 GB |
| BGE-large-en-v1.5 | 335M | 1024 | 512 | 1.3 GB |
| E5-large-v2 | 335M | 1024 | 512 | 1.3 GB |
| Nomic-embed-text-v1.5 | 137M | 768 | 8192 | 1.0 GB |
| Snowflake-arctic-embed-l | 335M | 1024 | 512 | 1.3 GB |
Throughput by Batch
BGE-base, 256-token sentences, FP16, TEI:
| Batch | texts/s |
|---|---|
| 1 | 420 |
| 8 | 2,800 |
| 32 | 7,200 |
| 64 | 9,100 |
| 128 | 9,800 |
| 256 | 10,200 |
Throughput plateaus at ~10k texts/s – hitting memory bandwidth.
TEI Per-Model Peak
| Model | Peak texts/s |
|---|---|
| BGE-small | 28,000 |
| BGE-base | 10,200 |
| BGE-large | 3,400 |
| Nomic-embed-v1.5 | 7,800 |
| Snowflake-arctic-l | 3,200 |
For reference – at 10k texts/s you can index a 10-million-document corpus in under 20 minutes.
Recommendation
- Default for RAG: BGE-base or Nomic-embed-v1.5 (long-context, 8k supported)
- Accuracy priority: BGE-large
- Bulk throughput: BGE-small – 28k texts/s lets you re-index often
Embedding Throughput on Blackwell 16GB
10k texts/s on BGE-base, trivially scales to millions of docs. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: TEI server setup, reranker throughput, SaaS RAG, RAG stack install, RAG pipeline.