Home / Blog / Model Guides / Command R 35B Self-Hosted

Model Guides

Command R 35B Self-Hosted

Cohere's Command R is a 35B model tuned for RAG and tool use - self-hosting it gives you a capable RAG backbone on one dedicated GPU.

Model Guides April 19, 2026 1 min read gigagpu

Cohere’s Command R at 35B parameters is tuned specifically for retrieval-augmented generation and tool use. It handles long contexts well and its RAG-formatted responses often beat generic LLMs at citation and accuracy. On our dedicated GPU hosting it fits a 32 GB or 96 GB card at reasonable precisions.

VRAM
GPU options
Deployment
RAG format

VRAM

Precision	Weights
FP16	~70 GB
FP8	~35 GB
AWQ INT4	~20 GB

GPU Options

RTX 3090 24GB: AWQ INT4 tight
RTX 5090 32GB: AWQ INT4 comfortable
RTX 6000 Pro 96GB: FP16 with real headroom

Deployment

python -m vllm.entrypoints.openai.api_server \
  --model CohereForAI/c4ai-command-r-v01 \
  --quantization awq \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92 \
  --trust-remote-code

Command R supports up to 128k context in its v2 releases. For RAG workloads 32k is usually plenty and saves KV cache.

RAG Format

Command R expects retrieved documents in a specific format with citation tokens. Use Cohere’s prompt template:

## Task and Context
You are an assistant...

## Documents
Document: 0
title: ...
text: ...

Document: 1
...

## Question
...

The model emits <co: doc_id> citation tokens inline. Strip them or render them as UI chips depending on your product.

Self-Hosted RAG-Tuned LLM

Command R on UK dedicated GPUs preconfigured for RAG pipelines.

Browse GPU Servers

For the larger variant see Command R+ 104B. For alternative RAG-oriented models see Qwen 2.5 14B.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Command R 35B Self-Hosted

Contents

VRAM

GPU Options

Deployment

RAG Format

Self-Hosted RAG-Tuned LLM

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Command R 35B Self-Hosted

Contents

VRAM

GPU Options

Deployment

RAG Format

Self-Hosted RAG-Tuned LLM

Need a Dedicated GPU Server?

gigagpu

Related Articles

Gemma 2 vs Gemma 1: Google’s Model Evolution

RTX 5060 Ti 16GB for DeepSeek Coder V2 Lite

DeepSeek for Product Image Captioning: GPU Requirements & Setup

HunyuanVideo VRAM Requirements: What It Takes to Run Tencent’s Video Model

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?