Finetuning LLaMA 3.1 8B
2x faster finetuning with Unsloth + QLoRA on a free Colab GPU
Overview
This project demonstrates finetuning Meta’s LLaMA 3.1 8B model using Unsloth — an open-source library that makes LLM finetuning 2x faster with 60% less VRAM compared to standard Hugging Face + Flash Attention 2 workflows. The entire pipeline runs on a free Google Colab T4 GPU, making state-of-the-art model customization accessible without any local hardware.
The primary goal here is to establish a working pipeline for hosting and showcasing interactive Google Colab notebooks directly within the site — proving that computational notebooks can live alongside traditional project write-ups as first-class portfolio content.
Notebook
The full finetuning notebook is embedded below via GitHub Gist. You can scroll through all the cells, inspect the code, and click “Open in Colab” to run it yourself with a free GPU runtime.
Resources
LLaMA 3.1 8B + Unsloth Finetuning Notebook
The primary Colab notebook — full finetuning pipeline with QLoRA, 4-bit quantization, and Hugging Face integration.
Unsloth — GitHub Repository
Open-source LLM finetuning library. 2x faster, 60% less memory. Supports LLaMA, Mistral, Gemma, Qwen, and more.
Unsloth Notebooks Collection
100+ finetuning tutorial notebooks on Google Colab and Kaggle covering LLaMA, Mistral, Phi, Gemma, and more.
Unsloth Blog — Finetune LLaMA 3.1
Official walkthrough and benchmarks for LLaMA 3.1 finetuning with Unsloth. Performance comparisons and configuration guide.
Pipeline Overview
The finetuning pipeline follows a standard supervised finetuning (SFT) workflow, accelerated by Unsloth’s custom CUDA kernels and memory optimizations. The key stages:
1. Model Loading with 4-bit Quantization
LLaMA 3.1 8B is loaded in 4-bit quantized format using bitsandbytes, reducing the memory footprint from ~16GB to ~5GB. Unsloth patches the model architecture at load time to enable its fused attention kernels.
2. LoRA Adapter Configuration
QLoRA adapters are applied to all linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). This adds only ~2% trainable parameters while enabling full model adaptation.
3. Training with SFTTrainer
Training uses Hugging Face’s SFTTrainer with Unsloth’s optimized backend. On a free Colab T4, a full finetuning run on the Alpaca dataset completes in roughly 30–45 minutes.
4. Export & Inference
The finetuned model can be exported to GGUF format for local inference with llama.cpp / Ollama, pushed to Hugging Face Hub, or used directly in the notebook for generation.
Performance
| Metric | Standard HF + FA2 | Unsloth |
|---|---|---|
| Training speed | 1x baseline | 2.1x faster |
| Peak VRAM usage | ~14 GB | ~5.5 GB |
| Min GPU (LoRA r=32) | 16 GB | 8 GB |
| Context length (8 GB) | 512 tokens | 2048 tokens |
| Colab T4 compatible | Barely | Comfortably |