Command R+ is Cohere’s 104B parameter flagship, the larger sibling of Command R 35B. It requires substantial hardware – usually multi-GPU – but delivers top-tier RAG quality with tool use out of the box. On our dedicated GPU hosting it is the serious-RAG-backbone choice.
Contents
VRAM
| Precision | Weights |
|---|---|
| FP16 | ~208 GB |
| FP8 | ~104 GB |
| AWQ INT4 | ~62 GB |
Hardware Options
- Single RTX 6000 Pro 96GB: AWQ INT4 fits with modest KV cache. The cheapest path.
- Two 6000 Pros: FP8 comfortable with high concurrency. Best production option.
- Two 5090s tensor parallel: AWQ INT4 fits (62 GB in 64 GB aggregate). Tight.
Deployment
Single 6000 Pro with AWQ:
python -m vllm.entrypoints.openai.api_server \
--model alpindale/c4ai-command-r-plus-GPTQ \
--quantization gptq \
--max-model-len 16384 \
--gpu-memory-utilization 0.93 \
--trust-remote-code
Dual 6000 Pros with FP8:
python -m vllm.entrypoints.openai.api_server \
--model CohereForAI/c4ai-command-r-plus \
--tensor-parallel-size 2 \
--quantization fp8 \
--max-model-len 32768 \
--trust-remote-code
When It Is Worth It
Command R+ is worth the hardware when:
- You need top-quality RAG accuracy and citation handling
- Your RAG workload is a core revenue-generating product
- You have budget for 6000 Pro class hardware
If you are building an internal tool or early-stage product, start with Command R 35B and step up if quality demands it.
Flagship RAG Hosting
Command R+ preconfigured on UK dedicated hardware matched to your workload.
Browse GPU ServersCompare the single-card options: Llama 3.3 70B and Qwen 2.5 72B.