How to Fine-Tune an LLM with LoRA: A Practical Step-by-Step Guide

What Is LoRA and Why Should You Use It?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that allows you to adapt a large pre-trained language model to a specific task without updating all of its billions of parameters. Instead, LoRA inserts small trainable matrices into the model's layers, dramatically reducing memory requirements and training time.

The result: you can fine-tune a 7B-parameter model on a single consumer GPU — something that would be impossible with full fine-tuning.

Prerequisites

Python 3.9+ installed
A GPU with at least 8GB VRAM (16GB recommended for 7B models)
Basic familiarity with PyTorch and HuggingFace Transformers
A labeled dataset relevant to your task

Step 1: Install Required Libraries

You'll need HuggingFace's transformers, peft (Parameter-Efficient Fine-Tuning), datasets, and bitsandbytes for quantization:

pip install transformers peft datasets bitsandbytes accelerate trl

Step 2: Load Your Base Model with Quantization

Use 4-bit quantization via bitsandbytes to reduce VRAM usage:

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.1", quantization_config=bnb_config)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

Step 3: Configure LoRA

Define the LoRA configuration. The key hyperparameters are rank (r), which controls the size of the adapter matrices, and alpha, which controls the scaling:

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)

Step 4: Prepare Your Dataset

Format your data as instruction-response pairs. A common format is the Alpaca template:

def format_prompt(example):
    return f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"

Load and tokenize using the HuggingFace datasets library, ensuring you set a max token length appropriate for your GPU memory.

Step 5: Train with the SFTTrainer

The trl library's SFTTrainer handles the training loop cleanly:

from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(output_dir="./lora-output", num_train_epochs=3, per_device_train_batch_size=4),
    dataset_text_field="text"
)
trainer.train()

Step 6: Save and Merge the Adapter

Save just the LoRA adapter weights (a few MB!) and optionally merge them back into the base model for deployment:

model.save_pretrained("./my-lora-adapter")
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./my-merged-model")

Tips for Better Results

Start with r=8 or r=16; higher ranks give more capacity but use more memory.
Apply LoRA to both Q and V projection matrices at minimum; adding K and output projections often improves results.
Use a cosine learning rate schedule with warmup for stable training.
Monitor validation loss to detect overfitting early — LoRA can overfit quickly on small datasets.

Conclusion

LoRA has democratized LLM fine-tuning by making it accessible to anyone with a modern GPU. With the right dataset and configuration, you can adapt powerful open-weight models to specialized tasks in hours rather than weeks.