Unsloth ships custom Triton kernels for LoRA forward/backward, optimised attention, and rewritten MLP blocks. On the RTX 5060 Ti 16GB at our hosting, it’s 1.7-2x faster than vanilla Transformers for the same config.
Contents
Install
pip install "unsloth[cu121-ampere-torch250] @ git+https://github.com/unslothai/unsloth.git"
(Use the matching CUDA build; check Unsloth docs for current flags. Blackwell is supported via the Ampere+ build path.)
Measured Speed Uplift
QLoRA on Llama 3.1 8B, seq 2048, bs 4:
| Framework | tokens/s | sec/step | Relative |
|---|---|---|---|
| HF Transformers | 4,900 | 1.68 | 1.0x |
| Unsloth | 8,700 | 0.94 | 1.78x |
Mistral 7B shows similar – 1.7x uplift. Qwen 2.5 14B QLoRA at bs 2 also gets 1.8x.
Memory Savings
Unsloth’s gradient checkpointing and fused kernels reduce peak VRAM:
| Config | HF peak | Unsloth peak |
|---|---|---|
| Llama 3 8B seq 2048 bs 4 | 11.8 GB | 9.6 GB |
| Llama 3 8B seq 4096 bs 2 | 13.2 GB | 10.4 GB |
| Llama 3 8B seq 8192 bs 1 | OOM | 11.6 GB |
The memory saving means Unsloth opens seq 8192 QLoRA training that vanilla HF cannot do on 16 GB at all.
Caveats
- Supports Llama, Mistral, Gemma, Qwen, Phi, CodeLlama – narrower model list than HF
- Custom
FastLanguageModel.from_pretrained()API (slightly different from HF) - Chat templates auto-applied via Unsloth’s
get_chat_template() - Multi-GPU requires Unsloth Pro (paid tier)
For single-GPU 7-14B QLoRA on 16 GB, Unsloth is the default choice.
Unsloth Fine-Tuning on Blackwell 16GB
1.78x faster, lower VRAM, 8192-seq capable. UK dedicated hosting.
Order the RTX 5060 Ti 16GBSee also: QLoRA speed, LoRA speed, QLoRA guide, LoRA guide, fine-tune throughput.