RTX 3050 - Order Now
Home / Blog / Model Guides / Command R 35B Self-Hosted
Model Guides

Command R 35B Self-Hosted

Cohere's Command R is a 35B model tuned for RAG and tool use - self-hosting it gives you a capable RAG backbone on one dedicated GPU.

Cohere’s Command R at 35B parameters is tuned specifically for retrieval-augmented generation and tool use. It handles long contexts well and its RAG-formatted responses often beat generic LLMs at citation and accuracy. On our dedicated GPU hosting it fits a 32 GB or 96 GB card at reasonable precisions.

Contents

VRAM

PrecisionWeights
FP16~70 GB
FP8~35 GB
AWQ INT4~20 GB

GPU Options

Deployment

python -m vllm.entrypoints.openai.api_server \
  --model CohereForAI/c4ai-command-r-v01 \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code

Command R supports up to 128k context in its v2 releases. For RAG workloads 32k is usually plenty and saves KV cache.

RAG Format

Command R expects retrieved documents in a specific format with citation tokens. Use Cohere’s prompt template:

## Task and Context
You are an assistant...

## Documents
Document: 0
title: ...
text: ...

Document: 1
...

## Question
...

The model emits <co: doc_id> citation tokens inline. Strip them or render them as UI chips depending on your product.

Self-Hosted RAG-Tuned LLM

Command R on UK dedicated GPUs preconfigured for RAG pipelines.

Browse GPU Servers

For the larger variant see Command R+ 104B. For alternative RAG-oriented models see Qwen 2.5 14B.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?