Hermes 3 from Nous Research is a fine-tune on Llama 3 base models tuned for role-play, agent workflows, and less-restrictive general-purpose use. The 8B variant fits the RTX 5060 Ti 16GB well on our hosting.
Contents
Fit
Hermes 3 8B is a Llama 3.1 8B fine-tune. VRAM matches base Llama 3 8B:
- FP16: ~16 GB
- FP8: ~8 GB
- AWQ INT4: ~5 GB
Deployment
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-8B \
--quantization fp8 \
--max-model-len 16384 \
--gpu-memory-utilization 0.92
ChatML Template
Hermes uses ChatML with specific role markers (different from base Llama’s Llama chat template). vLLM auto-detects from the tokeniser config. For tool use, Hermes follows a specific format documented in the model card – test tool calls during integration.
Strengths
Hermes 3 tends to:
- Follow system prompts more faithfully (less refusal drift)
- Handle role-play and character persona prompts better
- Produce agent-style structured outputs more reliably
- Be less restrictive on edge-case topics where stock Llama refuses
Tradeoffs
Fine-tunes can drift from the base model’s calibration on factual questions. For pure factual Q&A, stock Llama 3 is often safer. Pick Hermes when:
- System prompt adherence matters (creative writing, role-play, branded AI personas)
- You need agent-style tool use reliability
- You want less “as an AI language model…” hedging
Decode speed on the 5060 Ti matches base Llama 3 8B – ~100-110 t/s at FP8.
Agent-Tuned LLM Hosting
Hermes 3 on Blackwell 16GB. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: full Hermes 3 guide, agent backend use case.