Home / Blog / Model Guides / Run Phi-3 on a Dedicated GPU Server

Model Guides

Run Phi-3 on a Dedicated GPU Server

Complete guide to running Microsoft Phi-3 on a dedicated GPU server. Covers GPU selection for all Phi-3 sizes, vLLM and Ollama setup, benchmarks, and deployment tips.

Model Guides April 14, 2026 3 min read gigagpu

Table of Contents

GPU Selection for Phi-3
Install and Serve with vLLM
Quick Start with Ollama
Performance Benchmarks
Optimisation Tips
Next Steps

GPU Selection for Phi-3

Microsoft’s Phi-3 family includes Mini (3.8B), Small (7B), and Medium (14B) variants that punch well above their weight on reasoning benchmarks. Their compact size makes them an excellent fit for budget dedicated GPU servers. Here is the GPU mapping for Phi-3 hosting:

Phi-3 Variant	FP16 VRAM	INT4 VRAM	Recommended GPU
Phi-3 Mini (3.8B)	~7.6 GB	~2.8 GB	RTX 4060 (8 GB)
Phi-3 Small (7B)	~14 GB	~5.2 GB	RTX 4060 Ti (16 GB)
Phi-3 Medium (14B)	~28 GB	~9 GB	RTX 3090 (24 GB)
Phi-3 Mini (3.8B)	~7.6 GB	~2.8 GB	RTX 3050 (INT4 only)

Phi-3 Mini at FP16 fits entirely within an RTX 4060’s 8 GB, making it one of the cheapest models to self-host with strong benchmark performance. For a direct comparison with LLaMA 3, read our Phi-3 vs LLaMA 3 8B analysis.

Install and Serve with vLLM

# Install vLLM
pip install vllm

# Serve Phi-3 Mini
python -m vllm.entrypoints.openai.api_server \
  --model microsoft/Phi-3-mini-4k-instruct \
  --dtype float16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --port 8000

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "microsoft/Phi-3-mini-4k-instruct",
    "messages": [{"role": "user", "content": "Write a Python function to merge two sorted lists."}],
    "max_tokens": 512
  }'

For a comparison of serving frameworks, see our vLLM vs Ollama guide.

Quick Start with Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run Phi-3 Mini
ollama run phi3:mini

# Serve as API
ollama serve &
curl http://localhost:11434/api/generate \
  -d '{"model": "phi3:mini", "prompt": "Explain gradient descent in simple terms."}'

Performance Benchmarks

Benchmarked with vLLM, 512-token input, 256-token output. See the tokens-per-second benchmark tool for current data.

Model	GPU	Precision	Gen tok/s	TTFT
Phi-3 Mini	RTX 4060	FP16	118	92 ms
Phi-3 Mini	RTX 4060	AWQ 4-bit	162	68 ms
Phi-3 Mini	RTX 3090	FP16	135	78 ms
Phi-3 Small	RTX 4060 Ti	FP16	82	155 ms
Phi-3 Medium	RTX 3090	AWQ 4-bit	68	195 ms

Phi-3 Mini at FP16 delivers 118 tok/s on the RTX 4060, which is faster than most 7B models. The 3.8B parameter count means lower memory bandwidth requirements, which directly translates to higher per-token speed.

Optimisation Tips

Run at FP16 on 8 GB cards. Phi-3 Mini is small enough that quantisation is unnecessary on the RTX 4060.
Use the 128K context variant (Phi-3-mini-128k-instruct) for document analysis tasks, but keep actual context under 16K on 8 GB.
Pair with Whisper for voice-to-text-to-response pipelines. Both models fit on a single RTX 4060.
Use speculative decoding with Phi-3 Mini as a draft model for Phi-3 Medium to boost the larger model’s throughput.
Enable continuous batching to serve 4+ concurrent users at production quality on budget GPUs.

Estimate running costs with the cost calculator. For the cheapest GPU options, see our budget GPU for AI inference guide.

Next Steps

Phi-3 is one of the most cost-effective models for self-hosting. If you need stronger multilingual support, consider Qwen 2.5. For larger-scale deployments, see our LLaMA hosting options. Browse all models in the model guides section.

Deploy Phi-3 Now

Run Phi-3 on a dedicated GPU server starting from just an RTX 4060. Full root access and UK data centre hosting.

Browse GPU Servers

Need a Dedicated GPU Server?

Deploy from RTX 3050 to RTX 5090. Full root access, NVMe storage, 1Gbps — UK datacenter.

Browse GPU Servers

Model Guides

gigagpu

We benchmark, deploy, and optimise GPU infrastructure for AI workloads. All data in our guides comes from real-world testing on our UK-based dedicated GPU servers.

Ready to deploy your AI workload?

Dedicated GPU servers from our UK datacenter. NVMe storage, 1Gbps networking, full root access.

Browse GPU Servers Contact Sales

Run Phi-3 on a Dedicated GPU Server

GPU Selection for Phi-3

Install and Serve with vLLM

Quick Start with Ollama

Performance Benchmarks

Optimisation Tips

Next Steps

Deploy Phi-3 Now

Need a Dedicated GPU Server?

gigagpu

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help?

Run Phi-3 on a Dedicated GPU Server

GPU Selection for Phi-3

Install and Serve with vLLM

Quick Start with Ollama

Performance Benchmarks

Optimisation Tips

Next Steps

Deploy Phi-3 Now

Need a Dedicated GPU Server?

gigagpu

Related Articles

Coqui XTTS for a Voice Assistant: GPU Sizing and Pipeline Architecture

Coqui XTTS for a Voice Assistant: GPU Sizing and Pipeline Architecture

Mixtral 8x7B Quantization: Fitting MoE on Consumer GPUs

Sentence-BERT vs BGE vs E5: Embedding Model Comparison

GPU Hosting

Blog Categories

AI Model Hosting

Benchmarks & Tools

Deploy a GPU Server

Ready to deploy your AI workload?

Have a question? Need help? Contact us

Have a question? Need help?