When Should You Fine-Tune?
Fine-tuning is the right choice when prompt engineering hits its ceiling. If your domain has specialized terminology, a specific output format, or strict latency requirements that rule out large frontier models, fine-tuning a smaller base model is often more effective and 10x cheaper to serve.
Modern Fine-Tuning Techniques
- LoRA (Low-Rank Adaptation) — trains only small rank-decomposition matrices, reducing trainable parameters by 10,000x with minimal quality loss.
- QLoRA — combines LoRA with 4-bit quantization, enabling 70B model fine-tuning on a single A100.
- DPO (Direct Preference Optimization) — aligns model behavior with human preferences without a separate reward model, simplifying RLHF pipelines.
- Full fine-tuning — still the gold standard for maximum quality when you have sufficient data and GPU budget.
QLoRA let us fine-tune a 13B Llama model on our proprietary legal corpus in under 8 hours on two A100s. The resulting model outperformed GPT-4 on our internal benchmarks for contract review.