RTX 3050 - Order Now
Home / Blog / Model Guides / Command R+ 104B Deployment
Model Guides

Command R+ 104B Deployment

Cohere's flagship open-weights 104B RAG model needs serious hardware. Here is what it takes to host it on dedicated GPUs.

Command R+ is Cohere’s 104B parameter flagship, the larger sibling of Command R 35B. It requires substantial hardware – usually multi-GPU – but delivers top-tier RAG quality with tool use out of the box. On our dedicated GPU hosting it is the serious-RAG-backbone choice.

Contents

VRAM

PrecisionWeights
FP16~208 GB
FP8~104 GB
AWQ INT4~62 GB

Hardware Options

  • Single RTX 6000 Pro 96GB: AWQ INT4 fits with modest KV cache. The cheapest path.
  • Two 6000 Pros: FP8 comfortable with high concurrency. Best production option.
  • Two 5090s tensor parallel: AWQ INT4 fits (62 GB in 64 GB aggregate). Tight.

Deployment

Single 6000 Pro with AWQ:

python -m vllm.entrypoints.openai.api_server \
  --model alpindale/c4ai-command-r-plus-GPTQ \
  --quantization gptq \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.93 \
  --trust-remote-code

Dual 6000 Pros with FP8:

python -m vllm.entrypoints.openai.api_server \
  --model CohereForAI/c4ai-command-r-plus \
  --tensor-parallel-size 2 \
  --quantization fp8 \
  --max-model-len 32768 \
  --trust-remote-code

When It Is Worth It

Command R+ is worth the hardware when:

  • You need top-quality RAG accuracy and citation handling
  • Your RAG workload is a core revenue-generating product
  • You have budget for 6000 Pro class hardware

If you are building an internal tool or early-stage product, start with Command R 35B and step up if quality demands it.

Flagship RAG Hosting

Command R+ preconfigured on UK dedicated hardware matched to your workload.

Browse GPU Servers

Compare the single-card options: Llama 3.3 70B and Qwen 2.5 72B.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?