ColBERT v2 is a retrieval architecture that embeds each token of a document separately rather than producing one vector per document. At query time it scores via maximum similarity per query token (“late interaction”) – more accurate than single-vector search but with higher storage and compute costs. On dedicated GPU hosting it is a legitimate alternative to dense + rerank for some workloads.
Contents
Architecture
Each document becomes N 128-dim vectors (one per token). At query time, each query token’s vector finds its most similar document-token vector; scores are summed across query tokens (MaxSim). This interaction happens at retrieval time instead of compressed into a pre-computed vector, hence “late interaction”.
Deployment
Via the ragatouille library:
from ragatouille import RAGPretrainedModel
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
RAG.index(
collection=your_documents,
index_name="my_index",
split_documents=True,
)
results = RAG.search("query text", k=10)
Indexing takes longer than single-vector (one pass per token) but runs on any GPU.
Storage
A document of 500 tokens stores 500 × 128-dim × 1-byte (compressed) = 64 KB. A 10M document corpus: 640 GB of index storage. Dense embedding equivalent (1024-dim × 4 bytes × 10M) = 40 GB.
ColBERT is 10-15x more storage per document. Plan disk accordingly. Use PLAID compression (default in ColBERT v2) to cut storage in half.
When
Pick ColBERT when:
- Retrieval accuracy is paramount and you have budget for storage
- Your corpus is medium-size (1-10M documents, not 100M+)
- Queries are complex and benefit from multi-aspect matching
Skip ColBERT for huge corpora or latency-sensitive workloads where every ms counts. See late interaction retrieval for the broader discussion.
ColBERT Indexing and Serving
Multi-vector retrieval on UK dedicated GPUs with large NVMe for ColBERT indexes.
Browse GPU ServersFor simpler retrieval see BGE-M3.