Finetuning LLaMA 3.1 8B - Projects

Overview

This project demonstrates finetuning Meta’s LLaMA 3.1 8B model using Unsloth — an open-source library that makes LLM finetuning 2x faster with 60% less VRAM compared to standard Hugging Face + Flash Attention 2 workflows. The entire pipeline runs on a free Google Colab T4 GPU, making state-of-the-art model customization accessible without any local hardware.

The primary goal here is to establish a working pipeline for hosting and showcasing interactive Google Colab notebooks directly within the site — proving that computational notebooks can live alongside traditional project write-ups as first-class portfolio content.

Notebook

The full finetuning notebook is embedded below via GitHub Gist. You can scroll through all the cells, inspect the code, and click “Open in Colab” to run it yourself with a free GPU runtime.

Llama_3_1_8b_Unsloth_2x_faster_finetuning.ipynb Open in Colab

Resources

LLaMA 3.1 8B + Unsloth Finetuning Notebook

The primary Colab notebook — full finetuning pipeline with QLoRA, 4-bit quantization, and Hugging Face integration.

→

Unsloth — GitHub Repository

Open-source LLM finetuning library. 2x faster, 60% less memory. Supports LLaMA, Mistral, Gemma, Qwen, and more.

→

Unsloth Notebooks Collection

100+ finetuning tutorial notebooks on Google Colab and Kaggle covering LLaMA, Mistral, Phi, Gemma, and more.

→

Unsloth Blog — Finetune LLaMA 3.1

Official walkthrough and benchmarks for LLaMA 3.1 finetuning with Unsloth. Performance comparisons and configuration guide.

→

Pipeline Overview

The finetuning pipeline follows a standard supervised finetuning (SFT) workflow, accelerated by Unsloth’s custom CUDA kernels and memory optimizations. The key stages:

1. Model Loading with 4-bit Quantization

LLaMA 3.1 8B is loaded in 4-bit quantized format using bitsandbytes, reducing the memory footprint from ~16GB to ~5GB. Unsloth patches the model architecture at load time to enable its fused attention kernels.

from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained( model_name = "unsloth/Meta-Llama-3.1-8B-bnb-4bit", max_seq_length = 2048, load_in_4bit = True, )

2. LoRA Adapter Configuration

QLoRA adapters are applied to all linear layers (q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj). This adds only ~2% trainable parameters while enabling full model adaptation.

model = FastLanguageModel.get_peft_model( model, r = 16, lora_alpha = 16, lora_dropout = 0, target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], )

3. Training with SFTTrainer

Training uses Hugging Face’s SFTTrainer with Unsloth’s optimized backend. On a free Colab T4, a full finetuning run on the Alpaca dataset completes in roughly 30–45 minutes.

4. Export & Inference

The finetuned model can be exported to GGUF format for local inference with llama.cpp / Ollama, pushed to Hugging Face Hub, or used directly in the notebook for generation.

Performance

Metric	Standard HF + FA2	Unsloth
Training speed	1x baseline	2.1x faster
Peak VRAM usage	~14 GB	~5.5 GB
Min GPU (LoRA r=32)	16 GB	8 GB
Context length (8 GB)	512 tokens	2048 tokens
Colab T4 compatible	Barely	Comfortably