Qwen Coder 32B rivals closed-source coding models on most benchmarks and runs comfortably on a single 32 GB or larger GPU. On our dedicated GPU hosting it is the default recommendation for teams building self-hosted coding assistants.
Contents
VRAM
| Precision | Weights | Fits On |
|---|---|---|
| FP16 | ~64 GB | RTX 6000 Pro, multi-GPU |
| FP8 | ~32 GB | RTX 5090 tight, 6000 Pro comfortable |
| AWQ INT4 | ~18 GB | 24 GB (3090), 32 GB (5090), 96 GB (6000 Pro) |
GPU Options
- RTX 3090 24GB: AWQ INT4 with decent context. Budget pick.
- RTX 5090 32GB: AWQ INT4 or FP8 native. Best single-GPU speed.
- RTX 6000 Pro 96GB: FP16 native, very high concurrency.
- Intel Arc Pro B70 32GB: AWQ INT4 via OpenVINO/IPEX-LLM. Non-CUDA option.
Deployment
On a 5090 with AWQ:
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
--quantization awq \
--max-model-len 32768 \
--gpu-memory-utilization 0.92 \
--enable-prefix-caching
32k context matters for coding – large file edits and multi-file chains use it.
Tool Use and Fill-in-Middle
Qwen Coder supports fill-in-middle (FIM) via special tokens. For IDE autocomplete use cases, configure the client to send FIM markers:
<|fim_prefix|>code before cursor<|fim_suffix|>code after cursor<|fim_middle|>
The model fills in the middle. Tool calling also works via the model’s tool-use format – see our tool use guide for Qwen Coder.
Self-Hosted Coding Assistant
Preconfigured Qwen Coder 32B on any UK dedicated GPU that fits your budget.
Browse GPU ServersCompare against Codestral 22B and StarCoder 2 15B for smaller-footprint coding models. For the full model comparison see Qwen 2.5 72B deployment.