RTX 3050 - Order Now
Home / Blog / Use Cases / RTX 4090 24GB for Self-Hosted Coding Assistant: Qwen Coder 32B AWQ flagship for 5-20 engineers
Use Cases

RTX 4090 24GB for Self-Hosted Coding Assistant: Qwen Coder 32B AWQ flagship for 5-20 engineers

Self-host Qwen 2.5 Coder 32B AWQ on the RTX 4090 24GB to replace Copilot or Cursor for a 5-20 engineer team - HumanEval 92.7, vLLM stack, capacity math.

Sending proprietary source code to third-party APIs is a non-starter for many teams, and Copilot Business at $19/seat or Cursor Business at $40/seat compounds quickly. The RTX 4090 24GB changes the maths. A single card hosts Qwen 2.5 Coder 32B in AWQ INT4 at HumanEval 92.7, fitting in roughly 18 GB with FP8 KV, and serves a small-to-medium engineering team from one UK GPU box for around £550/month flat. This guide covers the model lineup, the IDE integrations that replace Copilot or Cursor, capacity planning for 5/10/20-engineer teams, the vLLM launch recipes, and the production gotchas we’ve hit shipping these stacks.

Contents

Why self-host coding

Three forces push teams towards self-hosted coding assistants. First, IP exposure: many large codebases include either customer data references or trade-secret algorithms that legal will not approve for transmission to a third-party API. Second, fleet cost: Copilot Business at $19/seat for 20 engineers is $380/month and counting. Third, latency control: a UK-hosted endpoint is typically 60-90 ms TTFT versus 200-300 ms transatlantic, which dominates inline-completion feel. With Qwen 2.5 Coder 32B AWQ now matching Sonnet on most coding benchmarks, the quality argument for staying on the API has weakened.

Model lineup on a single 4090

You will deploy one or two models on the same 4090. The flagship is Qwen 2.5 Coder 32B AWQ for chat, refactor and longer reasoning. For inline FIM (fill-in-the-middle) you either run Qwen 2.5 Coder 7B alongside, or use the 14B as a single-model compromise.

RoleModelQuantVRAM (weights + KV @ 8k)Decode t/sHumanEval
Flagship chat / refactorQwen 2.5 Coder 32BAWQ INT4~18 GB65 (single), 220 batch 892.7
All-rounderQwen 2.5 Coder 14BAWQ INT4~10 GB135~88
Inline FIM completionQwen 2.5 Coder 7BFP8~9 GB205~85
Embedding for code searchjina-code-embed-v2FP16~1.4 GB5,000 docs/sn/a
Alternative coderDeepSeek Coder V2 16BAWQ INT4~10 GB120~90

Two practical configurations: (a) Coder 32B AWQ alone for chat-led teams; (b) Coder 7B FP8 for inline plus Coder 32B on a second card for chat. The first scales to 12 engineers comfortably; the second scales past 30. See the Qwen Coder 32B model guide and Qwen Coder 14B guide for per-model deep dives.

Coding benchmarks

ModelHumanEvalMBPPMultiPL-E avgLiveCodeBench
Qwen 2.5 Coder 32B AWQ92.787.076.0~37
Qwen 2.5 Coder 14B AWQ88.082.072.0~32
Qwen 2.5 Coder 7B FP885.078.067.0~27
DeepSeek Coder V2 16B90.083.074.0~33
Claude 3.5 Sonnet (reference)92.0~88~78~42
GPT-4o (reference)~90~85~76~38

Qwen 2.5 Coder 32B AWQ is statistically indistinguishable from Sonnet and GPT-4o on HumanEval, leads on the multilingual coding average (Go, Rust, TypeScript, C++) thanks to Qwen’s diverse training corpus, and runs locally for £550/month flat. The only frontier model still meaningfully ahead is Sonnet on long agentic tasks (LiveCodeBench).

IDE plugins as Copilot / Cursor replacements

vLLM exposes an OpenAI-compatible endpoint, so every mainstream IDE plugin works without code changes. The dominant choices we ship:

  • Continue.dev (VS Code, JetBrains) — point Continue’s config.json at https://your-host:8000/v1 with model name qwen2.5-coder-32b-awq. Configure FIM to a separate 7B endpoint on a different port. This is the closest direct Copilot replacement.
  • Cursor (self-hosted backend mode) — Cursor’s “Custom Model” setting accepts an OpenAI base URL. Tab-completion can target the 7B endpoint, Cmd-K refactor and chat target the 32B endpoint.
  • Cline / Roo Code agent loops — both work against the same OpenAI endpoint and support tool calling via the standard schema; Qwen 2.5 Coder is reliable at function calling.
  • Aider CLIaider --openai-api-base https://your-host:8000/v1 --model qwen2.5-coder-32b-awq --weak-model qwen2.5-coder-7b.
  • Tabby for FIM-only — serve the 7B as the completion model with a thin Tabby wrapper for IDE distribution.

Capacity for 5/10/20-engineer teams

Coding traffic is bursty. A typical engineer fires 2-6 requests per active hour for chat/refactor and ~30 short FIM completions per active hour. Across an 8-hour working day with 60% active utilisation, that is roughly 30 chat requests and 150 FIM completions per engineer-day. A chat request decodes ~400 tokens at 65 t/s on the 32B AWQ — about 6 seconds per request — and FIM decodes 64 tokens at 205 t/s on 7B — about 0.3 seconds per request.

Team sizeRecommended stackCardsPeak concurrencyHeadroom
5 engineersCoder 32B AWQ only1x 4090~3 activePlenty
10 engineersCoder 32B AWQ only1x 4090~5 activeComfortable
12 engineers (sweet spot)Coder 32B AWQ only, chunked prefill, prefix caching1x 4090~5-6 activeTight at peaks
20 engineersCoder 32B AWQ + Coder 7B FIM2x 4090~10 chat + 30 FIMComfortable
30+ engineers2x Coder 32B + 1x Coder 7B FIM3x 4090~20 chat + 60 FIMLinear scale

For the 12-engineer sweet spot, peak active concurrency rarely exceeds 5 because not all engineers are coding simultaneously. With --max-num-seqs 4 and prefix caching dominating at 70%+ hit rate (system prompts, repo header context), a single 4090 is sufficient. See concurrent users benchmark for the per-batch throughput curves.

vLLM deployment recipes

The flagship 32B AWQ launch:

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
  --quantization awq_marlin --kv-cache-dtype fp8 \
  --max-model-len 32768 --max-num-seqs 4 \
  --enable-prefix-caching --enable-chunked-prefill \
  --gpu-memory-utilization 0.95 --port 8000

The companion 7B FIM endpoint (on a second card or second port if memory allows):

python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-Coder-7B-Instruct \
  --quantization fp8 --kv-cache-dtype fp8 \
  --max-model-len 4096 --max-num-seqs 32 \
  --enable-prefix-caching \
  --gpu-memory-utilization 0.40 --port 8001

Note: dual-loading both 32B and 7B on the same card requires gpu-memory-utilization 0.55 on the 32B and 0.40 on the 7B and is typically too tight for production traffic — use a second 4090 for FIM beyond a 10-engineer team. See the vLLM setup guide, AWQ guide, and FP8 deployment for image build details.

Cost vs Copilot / Cursor

Team sizeCopilot Business ($19/seat)Cursor Business ($40/seat)Self-host on 4090Saving vs Cursor (1 yr)
5 engineers$95/mo$200/mo£550/mo (~$700)self-host loses
10 engineers$190/mo$400/mo£550/mobreak-even on Cursor
12 engineers$228/mo$480/mo£550/moself-host wins on Cursor first month
20 engineers$380/mo$800/mo£1,100/mo (2 cards)~£3,500/yr saved vs Cursor
30 engineers$570/mo$1,200/mo£1,650/mo (3 cards)~£8,000/yr saved vs Cursor

The pure-cash break-even against Cursor lands at roughly 12 engineers; against Copilot it lands closer to 30 engineers. The non-cash factors — IP control, latency, model choice freedom — usually tip the decision earlier than that, particularly for regulated industries.

Production gotchas and verdict

  1. Prefix caching is mandatory. Most coding requests reuse the same system prompt and large repository-header context. Without prefix caching the 32B AWQ saturates the GPU at much lower team sizes.
  2. Long-file context needs chunked prefill. A request including a 10k-token file will block other tenants on a single GPU unless --enable-chunked-prefill is set. Without it, FIM latency spikes.
  3. FIM and chat in one model is a compromise. The 32B is too slow for inline; the 7B is too weak for refactor. Either accept the compromise (one 14B) or pay for two cards.
  4. Tool-calling JSON brittleness. Qwen 2.5 Coder is reliable at JSON tools but quoting in code-heavy tool arguments can break naive parsers. Use guided decoding (xgrammar) for production agents.
  5. Code-search embeddings need a separate process. Don’t co-locate the embedding model in vLLM — use a dedicated small server (text-embeddings-inference) so embedding bursts don’t preempt LLM tokens.
  6. Auth in front of the endpoint. vLLM has only basic API-key support. Front it with an authenticating reverse proxy (Caddy or Traefik) bound to your SSO.
  7. AWQ marlin kernels need CUDA 12.4+. Image bases on CUDA 12.1 will silently fall back to slow paths and you will wonder why the 32B runs at 35 t/s instead of 65.

Verdict. For any team of 8+ engineers serious about IP control, a single 4090 with Qwen 2.5 Coder 32B AWQ is the most cost-effective coding assistant you can deploy in 2026. It matches Sonnet on HumanEval, lives in your VLAN, and pays back against Cursor in the first month at 12 seats. Pair with a Coder 7B FIM endpoint on a second card once you cross 15 active developers.

Self-host your coding assistant in the UK

Frontier-class HumanEval, never leaves your VLAN, £550/mo flat. UK dedicated hosting.

Order the RTX 4090 24GB

See also: Qwen Coder 32B on 4090, Qwen Coder 14B, Qwen 32B benchmark, AWQ guide, vLLM setup, ROI analysis, 5060 Ti coding assistant.

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?