RTX 3050 - Order Now
Home / Blog / Benchmarks / Phi-3 Mini on RTX 4060: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-4060-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 4060: 18 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>
Benchmarks

Phi-3 Mini on RTX 4060: Performance Benchmark & Cost, Category: Benchmarks, Slug: phi-3-mini-on-rtx-4060-benchmark, Excerpt: Phi-3 Mini benchmarked on RTX 4060: 18 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 –>

Phi-3 Mini benchmarked on RTX 4060: 18 tok/s at 4-bit GGUF Q4_K_M, VRAM usage, cost per 1M tokens, and deployment configuration., Internal links: 9 -->

At just 3.8 billion parameters, Microsoft’s Phi-3 Mini punches well above its weight in reasoning tasks. But does the budget-friendly RTX 4060 give it enough room to stretch? We benchmarked the pairing on a GigaGPU dedicated server to find out.

Benchmark Results

MetricValue
Tokens/sec (single stream)18 tok/s
Tokens/sec (batched, bs=8)23.4 tok/s
Per-token latency55.6 ms
PrecisionINT4
Quantisation4-bit GGUF Q4_K_M
Max context length4K
Performance ratingGood

Testing used single-stream generation with a 512-token prompt and 256-token completion via llama.cpp, running the Q4_K_M GGUF quantisation.

Why 4-bit Quantisation Matters Here

The RTX 4060 ships with 8 GB of VRAM. Phi-3 Mini’s full FP16 weights occupy roughly 8 GB on their own, leaving nothing for KV cache or runtime overhead. Dropping to Q4_K_M cuts the weight footprint to about 3.1 GB, freeing up nearly 4.9 GB for context handling and concurrent sessions. That trade-off barely dents output quality for most inference tasks.

ComponentVRAM
Model weights (4-bit GGUF Q4_K_M)3.1 GB
KV cache + runtime~0.5 GB
Total RTX 4060 VRAM8 GB
Free headroom~4.9 GB

What Does It Cost?

Cost MetricValue
Server cost£0.35/hr (£69/mo)
Cost per 1M tokens£5.401
Tokens per £1185,151
Break-even vs API~1 req/day

Single-stream throughput works out to £5.40 per million tokens. Batch eight requests together and that drops to roughly £3.38/M. Compare that against hosted API endpoints charging £0.50–2.00+ per million tokens. At £69/mo flat, the RTX 4060 pays for itself almost immediately for any regular workload. Check our tokens-per-second benchmark tool to compare across GPUs.

Who Should Use This Setup?

Eighteen tokens per second is fast enough for interactive chat during development, internal tools, or low-traffic customer-facing bots. It is not the right pick for high-concurrency production APIs — for that, step up to the RTX 3090. But for prototyping, fine-tuning experiments, or a staging environment, the 4060 keeps costs low without starving the model.

Get started in one command:

docker run --gpus all -p 8080:8080 ghcr.io/ggerganov/llama.cpp:server -m /models/phi-3-mini.Q4_K_M.gguf --host 0.0.0.0 --port 8080 -ngl 99

More configuration details live in our Phi-3 hosting guide. You might also want to read the best GPU for LLM inference roundup, browse all benchmarks, or see how the cheapest GPU options stack up.

Run Phi-3 Mini on an RTX 4060 Today

Flat-rate dedicated GPU server. UK datacentre, full root access, no metered billing surprises.

Configure Your Server

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?