Home / Blog / Use Cases / RTX 4090 24GB for Self-Hosted Coding Assistant: Qwen Coder 32B AWQ flagship for 5-20 engineers

Use Cases

RTX 4090 24GB for Self-Hosted Coding Assistant: Qwen Coder 32B AWQ flagship for 5-20 engineers

Self-host Qwen 2.5 Coder 32B AWQ on the RTX 4090 24GB to replace Copilot or Cursor for a 5-20 engineer team - HumanEval 92.7, vLLM stack, capacity math.

Use Cases May 4, 2026 6 min read gigagpu

Sending proprietary source code to third-party APIs is a non-starter for many teams, and Copilot Business at $19/seat or Cursor Business at $40/seat compounds quickly. The RTX 4090 24GB changes the maths. A single card hosts Qwen 2.5 Coder 32B in AWQ INT4 at HumanEval 92.7, fitting in roughly 18 GB with FP8 KV, and serves a small-to-medium engineering team from one UK GPU box for around £550/month flat. This guide covers the model lineup, the IDE integrations that replace Copilot or Cursor, capacity planning for 5/10/20-engineer teams, the vLLM launch recipes, and the production gotchas we’ve hit shipping these stacks.

Why self-host coding
Model lineup on a single 4090
Coding benchmarks
IDE plugins as Copilot / Cursor replacements
Capacity for 5/10/20-engineer teams
vLLM deployment recipes
Cost vs Copilot / Cursor
Production gotchas and verdict

Why self-host coding

Three forces push teams towards self-hosted coding assistants. First, IP exposure: many large codebases include either customer data references or trade-secret algorithms that legal will not approve for transmission to a third-party API. Second, fleet cost: Copilot Business at $19/seat for 20 engineers is $380/month and counting. Third, latency control: a UK-hosted endpoint is typically 60-90 ms TTFT versus 200-300 ms transatlantic, which dominates inline-completion feel. With Qwen 2.5 Coder 32B AWQ now matching Sonnet on most coding benchmarks, the quality argument for staying on the API has weakened.

Model lineup on a single 4090

You will deploy one or two models on the same 4090. The flagship is Qwen 2.5 Coder 32B AWQ for chat, refactor and longer reasoning. For inline FIM (fill-in-the-middle) you either run Qwen 2.5 Coder 7B alongside, or use the 14B as a single-model compromise.

Role	Model	Quant	VRAM (weights + KV @ 8k)	Decode t/s	HumanEval
Flagship chat / refactor	Qwen 2.5 Coder 32B	AWQ INT4	~18 GB	65 (single), 220 batch 8	92.7
All-rounder	Qwen 2.5 Coder 14B	AWQ INT4	~10 GB	135	~88
Inline FIM completion	Qwen 2.5 Coder 7B	FP8	~9 GB	205	~85
Embedding for code search	jina-code-embed-v2	FP16	~1.4 GB	5,000 docs/s	n/a
Alternative coder	DeepSeek Coder V2 16B	AWQ INT4	~10 GB	120	~90

Two practical configurations: (a) Coder 32B AWQ alone for chat-led teams; (b) Coder 7B FP8 for inline plus Coder 32B on a second card for chat. The first scales to 12 engineers comfortably; the second scales past 30. See the Qwen Coder 32B model guide and Qwen Coder 14B guide for per-model deep dives.

Coding benchmarks

Model	HumanEval	MBPP	MultiPL-E avg	LiveCodeBench
Qwen 2.5 Coder 32B AWQ	92.7	87.0	76.0	~37
Qwen 2.5 Coder 14B AWQ	88.0	82.0	72.0	~32
Qwen 2.5 Coder 7B FP8	85.0	78.0	67.0	~27
DeepSeek Coder V2 16B	90.0	83.0	74.0	~33
Claude 3.5 Sonnet (reference)	92.0	~88	~78	~42
GPT-4o (reference)	~90	~85	~76	~38

Qwen 2.5 Coder 32B AWQ is statistically indistinguishable from Sonnet and GPT-4o on HumanEval, leads on the multilingual coding average (Go, Rust, TypeScript, C++) thanks to Qwen’s diverse training corpus, and runs locally for £550/month flat. The only frontier model still meaningfully ahead is Sonnet on long agentic tasks (LiveCodeBench).

IDE plugins as Copilot / Cursor replacements

vLLM exposes an OpenAI-compatible endpoint, so every mainstream IDE plugin works without code changes. The dominant choices we ship:

Continue.dev (VS Code, JetBrains) — point Continue’s config.json at https://your-host:8000/v1 with model name qwen2.5-coder-32b-awq. Configure FIM to a separate 7B endpoint on a different port. This is the closest direct Copilot replacement.
Cursor (self-hosted backend mode) — Cursor’s “Custom Model” setting accepts an OpenAI base URL. Tab-completion can target the 7B endpoint, Cmd-K refactor and chat target the 32B endpoint.
Cline / Roo Code agent loops — both work against the same OpenAI endpoint and support tool calling via the standard schema; Qwen 2.5 Coder is reliable at function calling.
Aider CLI — aider --openai-api-base https://your-host:8000/v1 --model qwen2.5-coder-32b-awq --weak-model qwen2.5-coder-7b.
Tabby for FIM-only — serve the 7B as the completion model with a thin Tabby wrapper for IDE distribution.

Capacity for 5/10/20-engineer teams

Coding traffic is bursty. A typical engineer fires 2-6 requests per active hour for chat/refactor and ~30 short FIM completions per active hour. Across an 8-hour working day with 60% active utilisation, that is roughly 30 chat requests and 150 FIM completions per engineer-day. A chat request decodes ~400 tokens at 65 t/s on the 32B AWQ — about 6 seconds per request — and FIM decodes 64 tokens at 205 t/s on 7B — about 0.3 seconds per request.

Team size	Recommended stack	Cards	Peak concurrency	Headroom
5 engineers	Coder 32B AWQ only	1x 4090	~3 active	Plenty
10 engineers	Coder 32B AWQ only	1x 4090	~5 active	Comfortable
12 engineers (sweet spot)	Coder 32B AWQ only, chunked prefill, prefix caching	1x 4090	~5-6 active	Tight at peaks
20 engineers	Coder 32B AWQ + Coder 7B FIM	2x 4090	~10 chat + 30 FIM	Comfortable
30+ engineers	2x Coder 32B + 1x Coder 7B FIM	3x 4090	~20 chat + 60 FIM	Linear scale

For the 12-engineer sweet spot, peak active concurrency rarely exceeds 5 because not all engineers are coding simultaneously. With --max-num-seqs 4 and prefix caching dominating at 70%+ hit rate (system prompts, repo header context), a single 4090 is sufficient. See concurrent users benchmark for the per-batch throughput curves.

vLLM deployment recipes

The flagship 32B AWQ launch:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
  --quantization awq_marlin --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 4 \
  --enable-prefix-caching --enable-chunked-prefill \
  --gpu-memory-utilization 0.95 --port 8000

The companion 7B FIM endpoint (on a second card or second port if memory allows):

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 4096 --max-num-seqs 32 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.40 --port 8001

Note: dual-loading both 32B and 7B on the same card requires gpu-memory-utilization 0.55 on the 32B and 0.40 on the 7B and is typically too tight for production traffic — use a second 4090 for FIM beyond a 10-engineer team. See the vLLM setup guide, AWQ guide, and FP8 deployment for image build details.

Cost vs Copilot / Cursor

Team size	Copilot Business ($19/seat)	Cursor Business ($40/seat)	Self-host on 4090	Saving vs Cursor (1 yr)
5 engineers	$95/mo	$200/mo	£550/mo (~$700)	self-host loses
10 engineers	$190/mo	$400/mo	£550/mo	break-even on Cursor
12 engineers	$228/mo	$480/mo	£550/mo	self-host wins on Cursor first month
20 engineers	$380/mo	$800/mo	£1,100/mo (2 cards)	~£3,500/yr saved vs Cursor
30 engineers	$570/mo	$1,200/mo	£1,650/mo (3 cards)	~£8,000/yr saved vs Cursor

The pure-cash break-even against Cursor lands at roughly 12 engineers; against Copilot it lands closer to 30 engineers. The non-cash factors — IP control, latency, model choice freedom — usually tip the decision earlier than that, particularly for regulated industries.

Production gotchas and verdict

Prefix caching is mandatory. Most coding requests reuse the same system prompt and large repository-header context. Without prefix caching the 32B AWQ saturates the GPU at much lower team sizes.
Long-file context needs chunked prefill. A request including a 10k-token file will block other tenants on a single GPU unless --enable-chunked-prefill is set. Without it, FIM latency spikes.
FIM and chat in one model is a compromise. The 32B is too slow for inline; the 7B is too weak for refactor. Either accept the compromise (one 14B) or pay for two cards.
Tool-calling JSON brittleness. Qwen 2.5 Coder is reliable at JSON tools but quoting in code-heavy tool arguments can break naive parsers. Use guided decoding (xgrammar) for production agents.
Code-search embeddings need a separate process. Don’t co-locate the embedding model in vLLM — use a dedicated small server (text-embeddings-inference) so embedding bursts don’t preempt LLM tokens.
Auth in front of the endpoint. vLLM has only basic API-key support. Front it with an authenticating reverse proxy (Caddy or Traefik) bound to your SSO.
AWQ marlin kernels need CUDA 12.4+. Image bases on CUDA 12.1 will silently fall back to slow paths and you will wonder why the 32B runs at 35 t/s instead of 65.

Verdict. For any team of 8+ engineers serious about IP control, a single 4090 with Qwen 2.5 Coder 32B AWQ is the most cost-effective coding assistant you can deploy in 2026. It matches Sonnet on HumanEval, lives in your VLAN, and pays back against Cursor in the first month at 12 seats. Pair with a Coder 7B FIM endpoint on a second card once you cross 15 active developers.

Self-host your coding assistant in the UK

Frontier-class HumanEval, never leaves your VLAN, £550/mo flat. UK dedicated hosting.

Order the RTX 4090 24GB

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Use Cases

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

RTX 4090 24GB for Self-Hosted Coding Assistant: Qwen Coder 32B AWQ flagship for 5-20 engineers

Contents

Why self-host coding

Model lineup on a single 4090

Coding benchmarks

IDE plugins as Copilot / Cursor replacements

Capacity for 5/10/20-engineer teams

vLLM deployment recipes

Cost vs Copilot / Cursor

Production gotchas and verdict

Self-host your coding assistant in the UK

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

RTX 4090 24GB for Self-Hosted Coding Assistant: Qwen Coder 32B AWQ flagship for 5-20 engineers

Contents

Why self-host coding

Model lineup on a single 4090

Coding benchmarks

IDE plugins as Copilot / Cursor replacements

Capacity for 5/10/20-engineer teams

vLLM deployment recipes

Cost vs Copilot / Cursor

Production gotchas and verdict

Self-host your coding assistant in the UK

Need a Dedicated GPU Server?

gigagpu

Related Articles

CNC Quality: Surface Finish Analysis on GPU

RTX 5060 Ti 16GB for Internal Enterprise Search

SDXL for Ecommerce Product Images: GPU Sizing, LoRAs and Cost vs Midjourney

AI for Publishing & Media: Self-Hosted

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?