Cohere’s Command R at 35B parameters is tuned specifically for retrieval-augmented generation and tool use. It handles long contexts well and its RAG-formatted responses often beat generic LLMs at citation and accuracy. On our dedicated GPU hosting it fits a 32 GB or 96 GB card at reasonable precisions.
Contents
VRAM
| Precision | Weights |
|---|---|
| FP16 | ~70 GB |
| FP8 | ~35 GB |
| AWQ INT4 | ~20 GB |
GPU Options
- RTX 3090 24GB: AWQ INT4 tight
- RTX 5090 32GB: AWQ INT4 comfortable
- RTX 6000 Pro 96GB: FP16 with real headroom
Deployment
python -m vllm.entrypoints.openai.api_server \
--model CohereForAI/c4ai-command-r-v01 \
--quantization awq \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--trust-remote-code
Command R supports up to 128k context in its v2 releases. For RAG workloads 32k is usually plenty and saves KV cache.
RAG Format
Command R expects retrieved documents in a specific format with citation tokens. Use Cohere’s prompt template:
## Task and Context
You are an assistant...
## Documents
Document: 0
title: ...
text: ...
Document: 1
...
## Question
...
The model emits <co: doc_id> citation tokens inline. Strip them or render them as UI chips depending on your product.
Self-Hosted RAG-Tuned LLM
Command R on UK dedicated GPUs preconfigured for RAG pipelines.
Browse GPU ServersFor the larger variant see Command R+ 104B. For alternative RAG-oriented models see Qwen 2.5 14B.