The Challenge: Describing Jewellery in Words Is Nearly Impossible
A London-based online marketplace specialising in vintage and antique jewellery lists over 150,000 unique pieces from 400 independent dealers. Their customers face a persistent frustration: describing what they want in a search box is almost impossible. How do you type a query for “Art Deco emerald cocktail ring with geometric platinum mounting”? Most shoppers arrive with a screenshot from Instagram or a photo taken at an estate sale, wanting to find something similar. The marketplace’s text-based search returns irrelevant results, and their browse-by-category approach cannot surface visually similar pieces across categories. Customer surveys reveal that 42% of visitors leave after failing to find items matching their visual inspiration.
Third-party visual search APIs exist but charge per query and require uploading customer images — and the jewellery they photograph — to external servers. For a platform handling images of pieces worth thousands of pounds, this introduces both data protection concerns and competitive intelligence risks.
AI Solution: CLIP-Based Visual Similarity Search
Visual search uses contrastive vision-language models like CLIP, SigLIP, or OpenCLIP to encode images into dense vectors that capture visual semantics. A product image of an Art Deco ring and a customer’s blurry photo of a similar ring produce vectors that are close in embedding space — even though the images look nothing alike pixel-by-pixel.
The pipeline works in two phases. Offline, every product listing image is encoded into a 512 or 768-dimensional vector and stored in a vector index. Online, a customer uploads a photo, the vision model encodes it in real-time, and the system retrieves the nearest product vectors. Results appear in under 500 milliseconds, ranked by visual similarity. Hosting the entire pipeline on a dedicated GPU server keeps response times consistent and customer images private.
GPU Requirements
CLIP models are relatively lightweight — ViT-L/14 has 428 million parameters and consumes approximately 3 GB of VRAM. The bottleneck is throughput: encoding a high-resolution product image takes 15-30ms per image on GPU, and the initial catalogue encoding of 150,000 images needs to complete in a reasonable timeframe.
| GPU Model | VRAM | Query Encoding (ms) | Catalogue Encode (150K images) |
|---|---|---|---|
| NVIDIA RTX 5090 | 24 GB | ~12ms | ~35 minutes |
| NVIDIA RTX 6000 Pro | 48 GB | ~15ms | ~40 minutes |
| NVIDIA RTX 6000 Pro | 48 GB | ~10ms | ~28 minutes |
| NVIDIA RTX 6000 Pro 96 GB | 80 GB | ~8ms | ~20 minutes |
For real-time visual search with sub-second responses, any GPU in the range handles the query encoding comfortably. The RTX 5090 offers the best cost-to-performance ratio for this workload. Pair with private AI hosting to keep all image data within UK infrastructure.
Recommended Stack
- OpenCLIP or SigLIP for image encoding, running on PyTorch with CUDA acceleration.
- Qdrant or FAISS for vector similarity search, optimised for sub-10ms retrieval at 150K scale.
- FastAPI for the search microservice, accepting image uploads and returning ranked product IDs.
- Pillow + torchvision for image preprocessing (resize, normalise, centre-crop).
For marketplaces wanting to combine visual and text search, CLIP natively supports cross-modal queries — a customer can type “gold bracelet with leaf pattern” and get results ranked by visual similarity to that description. Adding Stable Diffusion or ComfyUI enables generating AI mockups of custom pieces based on customer descriptions.
Cost Analysis
Third-party visual search APIs charge £0.003–£0.01 per query. At 800,000 monthly searches, that ranges from £2,400 to £8,000 per month. Self-hosting on a dedicated GPU eliminates per-query costs entirely, with a predictable monthly server fee and unlimited queries. The platform also gains the ability to fine-tune the model on their specific jewellery domain, improving accuracy for niche categories like Georgian mourning jewellery or Scandinavian modernist silver.
Fine-tuning CLIP on domain-specific data typically improves retrieval accuracy by 20-35% compared to the generic pre-trained model. This customisation is impossible with most third-party APIs.
Getting Started
Begin by encoding your existing product image catalogue with a pre-trained CLIP model and loading vectors into Qdrant. Test retrieval quality by running your 100 most common customer-uploaded images through the system and evaluating whether the top 10 results contain visually relevant items. Fine-tune on your product domain if accuracy falls below target.
GigaGPU provides UK-based dedicated GPU servers configured for vision AI workloads. Add an AI chatbot for conversational product discovery, or connect an image generator to create product visualisations on demand.
GigaGPU offers dedicated GPU servers in UK data centres with full GDPR compliance. Deploy CLIP-based image search on private infrastructure today.
View Dedicated GPU Plans