RTX 3050 - Order Now
Home / Blog / Model Guides / Run Phi-3 on a Dedicated GPU Server
Model Guides

Run Phi-3 on a Dedicated GPU Server

Complete guide to running Microsoft Phi-3 on a dedicated GPU server. Covers GPU selection for all Phi-3 sizes, vLLM and Ollama setup, benchmarks, and deployment tips.

GPU Selection for Phi-3

Microsoft’s Phi-3 family includes Mini (3.8B), Small (7B), and Medium (14B) variants that punch well above their weight on reasoning benchmarks. Their compact size makes them an excellent fit for budget dedicated GPU servers. Here is the GPU mapping for Phi-3 hosting:

Phi-3 VariantFP16 VRAMINT4 VRAMRecommended GPU
Phi-3 Mini (3.8B)~7.6 GB~2.8 GBRTX 4060 (8 GB)
Phi-3 Small (7B)~14 GB~5.2 GBRTX 4060 Ti (16 GB)
Phi-3 Medium (14B)~28 GB~9 GBRTX 3090 (24 GB)
Phi-3 Mini (3.8B)~7.6 GB~2.8 GBRTX 3050 (INT4 only)

Phi-3 Mini at FP16 fits entirely within an RTX 4060’s 8 GB, making it one of the cheapest models to self-host with strong benchmark performance. For a direct comparison with LLaMA 3, read our Phi-3 vs LLaMA 3 8B analysis.

Install and Serve with vLLM

# Install vLLM
pip install vllm

# Serve Phi-3 Mini
python -m vllm.entrypoints.openai.api_server \
  --model microsoft/Phi-3-mini-4k-instruct \
  --dtype float16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --port 8000

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
    "max_tokens": 512
  }'

For a comparison of serving frameworks, see our vLLM vs Ollama guide.

Quick Start with Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Phi-3 Mini
ollama run phi3:mini

# Serve as API
ollama serve &
curl http://localhost:11434/api/generate \
  -d '{"model": "phi3:mini", "prompt": "Explain gradient descent in simple terms."}'

Performance Benchmarks

Benchmarked with vLLM, 512-token input, 256-token output. See the tokens-per-second benchmark tool for current data.

ModelGPUPrecisionGen tok/sTTFT
Phi-3 MiniRTX 4060FP1611892 ms
Phi-3 MiniRTX 4060AWQ 4-bit16268 ms
Phi-3 MiniRTX 3090FP1613578 ms
Phi-3 SmallRTX 4060 TiFP1682155 ms
Phi-3 MediumRTX 3090AWQ 4-bit68195 ms

Phi-3 Mini at FP16 delivers 118 tok/s on the RTX 4060, which is faster than most 7B models. The 3.8B parameter count means lower memory bandwidth requirements, which directly translates to higher per-token speed.

Optimisation Tips

  • Run at FP16 on 8 GB cards. Phi-3 Mini is small enough that quantisation is unnecessary on the RTX 4060.
  • Use the 128K context variant (Phi-3-mini-128k-instruct) for document analysis tasks, but keep actual context under 16K on 8 GB.
  • Pair with Whisper for voice-to-text-to-response pipelines. Both models fit on a single RTX 4060.
  • Use speculative decoding with Phi-3 Mini as a draft model for Phi-3 Medium to boost the larger model’s throughput.
  • Enable continuous batching to serve 4+ concurrent users at production quality on budget GPUs.

Estimate running costs with the cost calculator. For the cheapest GPU options, see our budget GPU for AI inference guide.

Next Steps

Phi-3 is one of the most cost-effective models for self-hosting. If you need stronger multilingual support, consider Qwen 2.5. For larger-scale deployments, see our LLaMA hosting options. Browse all models in the model guides section.

Deploy Phi-3 Now

Run Phi-3 on a dedicated GPU server starting from just an RTX 4060. Full root access and UK data centre hosting.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

admin

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Have a question? Need help?