Fine-tune a language model with LoRA
LoRA (Low-Rank Adaptation) fine-tunes a model by training a tiny adapter instead of the full weight matrix. You get 90% of the quality at a fraction of the compute.
Prerequisites
- +GPU with 24GB+ VRAM (A100, RTX 4090) or rented Colab Pro
- +Python 3.10+
- +A JSONL dataset of instruction-response pairs
Step-by-Step
- 1
Install the stack
PEFT provides the LoRA implementation. transformers + accelerate handle the training loop. bitsandbytes enables 4-bit quantization.
pip install transformers peft accelerate bitsandbytes datasets trl - 2
Load model in 4-bit
Quantizing the base model frees VRAM for the adapter.
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig import torch bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16) model = AutoModelForCausalLM.from_pretrained('meta-llama/Meta-Llama-3-8B', quantization_config=bnb) tok = AutoTokenizer.from_pretrained('meta-llama/Meta-Llama-3-8B') - 3
Configure LoRA
r controls adapter capacity. Start at 16, increase if quality plateaus.
from peft import LoraConfig, get_peft_model cfg = LoraConfig(r=16, lora_alpha=32, target_modules=['q_proj','v_proj'], lora_dropout=0.05, task_type='CAUSAL_LM') model = get_peft_model(model, cfg) model.print_trainable_parameters() - 4
Format the dataset
Use the model's chat template. Mismatched formatting is the #1 cause of useless fine-tunes.
from datasets import load_dataset ds = load_dataset('json', data_files='train.jsonl') def fmt(ex): return { 'text': tok.apply_chat_template(ex['messages'], tokenize=False) } ds = ds.map(fmt) - 5
Train with SFTTrainer
TRL's SFTTrainer wraps the loop. 3 epochs, lr 2e-4, batch size scaled to your VRAM.
from trl import SFTTrainer, SFTConfig cfg = SFTConfig(output_dir='out', num_train_epochs=3, per_device_train_batch_size=4, learning_rate=2e-4, bf16=True) trainer = SFTTrainer(model=model, args=cfg, train_dataset=ds['train'], dataset_text_field='text') trainer.train() - 6
Merge and serve
Merge the adapter into the base for single-file inference, or keep separate to swap personalities.
model = model.merge_and_unload() model.save_pretrained('llama3-finetuned')
Common Pitfalls
- !Wrong chat template silently destroys quality. Print one example before training.
- !Tiny dataset (<500 examples) overfits in epoch 1.
- !Saving without merging produces a tiny adapter that needs the base model to load.
DevDigest Academy
Structured AI engineering courses with hands-on labs. Build production-ready apps faster.
What's Next
- ->Run evals (MMLU, HellaSwag, your task-specific benchmark) before/after.
- ->Try DPO or ORPO for preference tuning on top of SFT.
