Hermes 3 from Nous Research is a series of fine-tunes on Llama 3 base models (8B, 70B, 405B) tuned for agent workflows, role-playing, and less-restrictive general-purpose use. On our dedicated GPU hosting hardware requirements match stock Llama 3 exactly – swap the model ID in any Llama setup.
Contents
Variants
| Variant | Base | VRAM (INT4) |
|---|---|---|
| Hermes 3 8B | Llama 3.1 8B | ~5 GB |
| Hermes 3 70B | Llama 3.1 70B | ~40 GB |
| Hermes 3 405B | Llama 3.1 405B | Multi-GPU only |
Deployment
python -m vllm.entrypoints.openai.api_server \
--model NousResearch/Hermes-3-Llama-3.1-70B \
--quantization awq \
--max-model-len 16384 \
--gpu-memory-utilization 0.93
Hermes uses the ChatML template (slightly different from Llama’s default). vLLM auto-detects from the tokeniser config.
Strengths
Hermes 3 tends to:
- Follow system prompts more faithfully (less refusal drift)
- Handle complex role-play and character persona prompts better
- Produce agent-style structured outputs (tool calls) more reliably
- Be less restrictive on edge-case topics where stock Llama refuses
Trade-off: fine-tunes can drift from the base model’s calibration on factual questions. For pure factual Q&A stock Llama 3.3 is often safer.
Agent-Tuned LLM Hosting
Hermes 3 variants on UK dedicated GPUs, matched to the size you need.
Browse GPU ServersFor base Llama see Llama 3.3 70B and for the coding-tuned track see Qwen Coder 32B.