RTX 3050 - Order Now
Home / Blog / Tutorials / ColBERT v2 on a GPU Server – Late Interaction Retrieval
Tutorials

ColBERT v2 on a GPU Server – Late Interaction Retrieval

ColBERT stores a vector per token rather than per document - late-interaction scoring that beats single-vector embeddings on many tasks.

ColBERT v2 is a retrieval architecture that embeds each token of a document separately rather than producing one vector per document. At query time it scores via maximum similarity per query token (“late interaction”) – more accurate than single-vector search but with higher storage and compute costs. On dedicated GPU hosting it is a legitimate alternative to dense + rerank for some workloads.

Contents

Architecture

Each document becomes N 128-dim vectors (one per token). At query time, each query token’s vector finds its most similar document-token vector; scores are summed across query tokens (MaxSim). This interaction happens at retrieval time instead of compressed into a pre-computed vector, hence “late interaction”.

Deployment

Via the ragatouille library:

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
RAG.index(
    collection=your_documents,
    index_name="my_index",
    split_documents=True,
)
results = RAG.search("query text", k=10)

Indexing takes longer than single-vector (one pass per token) but runs on any GPU.

Storage

A document of 500 tokens stores 500 × 128-dim × 1-byte (compressed) = 64 KB. A 10M document corpus: 640 GB of index storage. Dense embedding equivalent (1024-dim × 4 bytes × 10M) = 40 GB.

ColBERT is 10-15x more storage per document. Plan disk accordingly. Use PLAID compression (default in ColBERT v2) to cut storage in half.

When

Pick ColBERT when:

  • Retrieval accuracy is paramount and you have budget for storage
  • Your corpus is medium-size (1-10M documents, not 100M+)
  • Queries are complex and benefit from multi-aspect matching

Skip ColBERT for huge corpora or latency-sensitive workloads where every ms counts. See late interaction retrieval for the broader discussion.

ColBERT Indexing and Serving

Multi-vector retrieval on UK dedicated GPUs with large NVMe for ColBERT indexes.

Browse GPU Servers

For simpler retrieval see BGE-M3.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?