← Projects
Machine Learning • LLM Finetuning • Google Colab

Finetuning LLaMA 3.1 8B

2x faster finetuning with Unsloth + QLoRA on a free Colab GPU

Meta LLaMA 3.1 8B Unsloth + Hugging Face 2025
LLaMA 3.1 8B finetuning pipeline with training loss curve and code

Overview

This project demonstrates finetuning Meta’s LLaMA 3.1 8B model using Unsloth — an open-source library that makes LLM finetuning 2x faster with 60% less VRAM compared to standard Hugging Face + Flash Attention 2 workflows. The entire pipeline runs on a free Google Colab T4 GPU, making state-of-the-art model customization accessible without any local hardware.

The primary goal here is to establish a working pipeline for hosting and showcasing interactive Google Colab notebooks directly within the site — proving that computational notebooks can live alongside traditional project write-ups as first-class portfolio content.

Notebook

The full finetuning notebook is embedded below via GitHub Gist. You can scroll through all the cells, inspect the code, and click “Open in Colab” to run it yourself with a free GPU runtime.

Llama_3_1_8b_Unsloth_2x_faster_finetuning.ipynb Open in Colab

Resources

Pipeline Overview

The finetuning pipeline follows a standard supervised finetuning (SFT) workflow, accelerated by Unsloth’s custom CUDA kernels and memory optimizations. The key stages:

1. Model Loading with 4-bit Quantization

LLaMA 3.1 8B is loaded in 4-bit quantized format using bitsandbytes, reducing the memory footprint from ~16GB to ~5GB. Unsloth patches the model architecture at load time to enable its fused attention kernels.

from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit", max_seq_length = 2048, load_in_4bit = True, )

2. LoRA Adapter Configuration

QLoRA adapters are applied to all linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). This adds only ~2% trainable parameters while enabling full model adaptation.

model = FastLanguageModel.get_peft_model( model, r = 16, lora_alpha = 16, lora_dropout = 0, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], )

3. Training with SFTTrainer

Training uses Hugging Face’s SFTTrainer with Unsloth’s optimized backend. On a free Colab T4, a full finetuning run on the Alpaca dataset completes in roughly 30–45 minutes.

4. Export & Inference

The finetuned model can be exported to GGUF format for local inference with llama.cpp / Ollama, pushed to Hugging Face Hub, or used directly in the notebook for generation.

Performance

Metric Standard HF + FA2 Unsloth
Training speed 1x baseline 2.1x faster
Peak VRAM usage ~14 GB ~5.5 GB
Min GPU (LoRA r=32) 16 GB 8 GB
Context length (8 GB) 512 tokens 2048 tokens
Colab T4 compatible Barely Comfortably