LLM finetuning on Snellius

This tutorial will showcase how to instruct finetune a large language model (LLM) on Snellius using Unsloth and serves as supplementary materials to the LLM finetune on Snellius code.

Here, we will focus on efficiently and effectively using the resources. To this end, we will adopt Unsloths QLoRA implementation. Equipping QLoRA significantly reduces the memory required for finetuning without notable loss of precision (Dettmers et al., 2023). Reducing the memory consumption also allows us to finetune larger models or increase the batch sizes!

Introduction to QLoRA

Quantized Low-Rank adaption, or QLoRA, essentially keeps the initial weights of the model frozen/static and instead learns a decomposable weight matrix that will be applied to the static foundation model. This way, we do not have to store and compute the memory-intensive gradient and optimizer states of the original LLM. The weight matrix is derived using the hyperparameter r which specifies the rank of the matrix. A higher r means a larger matrix. The original paper of LoRA and QLoRA experimented with different values or r. It turns out that the value of r does not significantly impact the performance. Typically, r is set to 16 and will result in a trainable matrix of only 0.5% of the original parameters! QLoRA extends this idea by quantizing the original LLM into 4-bit format using Normal-Float4. Consequently, the memory footprint of the stored LLM is further reduced. Keep in mind that the LLM still needs to be loaded onto the GPU to apply the weight matrix to and compute the loss!

Unsloth further optimizes the instruction finetune pipeline of QLoRA with some smart tricks and makes it a competitive framework.

Did you know: LoRA originated from diffusion models for image generation like Stable Diffusion.

Code

The code can be found here including a job script to finetune on Snellius: link

Estimating the GPU memory for your finetune task: link

Experimental setup

Model	Dataset	Context length	Hardware	Library	QLoRA	Batch size	Optimizer	# Samples	Training time
Llama 3.1 8B 4-bit	Open-Orca/SlimOrca	8192	1 H100 GPU (94GiB)	Unsloth		8 train / 8 eval	AdamW 8-bit	518K	8.5 hours

Walkthrough

This walkthrough is supplementary material of this SURF-made codebase here

Import necessities

As always, start off by defining the imports. Make sure you have followed the installation instructions from here

All libraries are dependencies from Unsloth and will be installed for you upon installing Unsloth!

The environment variable HF_HOME is also set where the model, tokenizer and dataset will be downloaded to!

import os

from trl import SFTConfig, SFTTrainer
from unsloth import FastLanguageModel
from unsloth.chat_templates import get_chat_template
from datasets import load_dataset

os.environ["HF_HOME"] = f"/scratch-shared/{os.environ["USER"]}/hf-cache-dir"

Setting up the finetune config

Next up is defining the supervised finetuning config. These are the necessary parameters we need to pass to the trainer.

The output directory will be used to store intermediate and the final checkpoint of the adapter model and some log information.

The maximum sequence length (max_seq_length) signifies the ability for the LLM to comprehend the information in a single pass. Thus, a sequence (or context) length of 8192 means the model can utilize its window span of 8192 tokens (roughly 10,000 words) to retrieve relevant information and return an answer.

Finetune config

max_seq_length = 8192
training_args = SFTConfig(
    dataset_text_field="text",
    max_seq_length=max_seq_length,
    output_dir=f"/scratch-shared/{os.environ["USER"]}/qlora_finetune",
)

Loading the model and tokenizer

Let's start by loading the model and tokenizer. As we use QLoRA, we want to load the model in 4-bit precision. Here, we load a pre-quantized 4-bit model by Unsloth to reduce the quantization time. Alternatively, you can load a different model (see here for all supported Unsloth models) even when it's not quantized like the official Meta-Llama-3.1 from HuggingFace. Keep in mind that some of these models are 'gated' models and can only be downloaded after accepting terms and conditions. Please follow this tutorial if you encountered a 'gated' model or dataset.

Loading model and tokenizer

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    max_seq_length=max_seq_length,
    device_map="auto",
    dtype=None,  # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
    load_in_4bit=True,
)

Loading the dataset

As a dataset, we opted for Open-Orca/SlimOrca due to its standard formatting, easy for integration into these frameworks and applicability for targeted instruction learning.

Because there is no explicit chat template defined for foundation models, we choose for the "chatml" template which is one the standards nowadays. In case you are finetuning from an existing instruct-finetuned LLM, then skip the foundation_model flag and avoid getting the chat template.

Lastly, a mapping is performed to convert the nested dictionary samples into a list of formatted strings with their corresponding chat template. An example to illustrate:

{ "from": "system", "value": "You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.", "weight": null }, { "from": "human", "value": "Please answer the following question: - They are buried under layers of soil - Pressure builds over time - The remains liquefy - The carbon atoms rearrange to become a new substance. What might be the first step of the process?\nA:", "weight": 0 }, { "from": "gpt", "value": "A: The first step of the process is \"They are buried under layers of soil.\" This occurs when the remains of plants, animals, or other organic material become covered by soil and other sediments. Over time, as more and more layers accumulate, the pressure and heat increase, eventually leading to the transformation of the remains into substances like coal, oil, or natural gas.", "weight": 1 } ]

is converted to a single string of:

system: You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.

user: Please answer the following question: - They are buried under layers of soil - Pressure builds over time - The remains liquefy - The carbon atoms rearrange to become a new substance. What might be the first step of the process?

assistant", "value": "A: The first step of the process is \"They are buried under layers of soil.\" This occurs when the remains of plants, animals, or other organic material become covered by soil and other sediments. Over time, as more and more layers accumulate, the pressure and heat increase, eventually leading to the transformation of the remains into substances like coal, oil, or natural gas.

Keep in mind that especially this step is subject to change for your dataset!

Loading dataset

dataset = load_dataset("Open-Orca/SlimOrca", split="train")
# Define own template if finetuning from pre-trained model. If continue from a instruct finetune, then use the native tokenizer and chat template
if from_foundation_model:
	tokenizer = get_chat_template(
	    tokenizer,
	    mapping={
		"role": "from",
		"content": "value",
		"user": "human",
		"assistant": "gpt",
	    },
	    chat_template="chatml",
	    map_eos_token=True,
	)

	# Dataset-specific function to convert the samples (in dictionaries) to strings with corresponding template
	def formatting_prompts_func(examples):
	convos = examples["conversations"]
	texts = [
	    tokenizer.apply_chat_template(
		convo, tokenize=False, add_generation_prompt=False
	    )
	    for convo in convos
	]
	return {
	    "text": texts,
	}

# Apply applying the samples dictionaries to string
dataset = dataset.map(
	formatting_prompts_func, batched=True, num_proc=os.cpu_count() // 2
	)

QLoRA magic

Below, the QLoRA is applied to the original model. The QLoRA hyperparameters are set here. Generally, these settings work well already. If you want to know more, we recommend reading this blog post.

QLoRA magic

model = FastLanguageModel.get_peft_model(
    model,
    r=16,  # rank of parameters. Higher R means more parameters
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
    lora_alpha=16,  # scaling of the weights
    lora_dropout=0,  # Dropout = 0 is currently optimized
    bias="none",  # Bias = "none" is currently optimized
    use_gradient_checkpointing="unsloth",
    max_seq_length=max_seq_length,
    random_state=47,
)

Start finetuning!

Lastly, initialize the trainer and already start training. It's that easy!

After training, the static model and the weight matrix, or adapter, is saved at the output directory in your /scratch-shared/ folder. An adapter can then be merged with the original foundation model to obtain the final model.

Start finetuning!

trainer = SFTTrainer(
    model=model,
    args=sft_config,
    train_dataset=dataset,
)

trainer_stats = trainer.train()
print(trainer_stats)

Running on SLURM

See here for the complete runnable codebase including job script

Next up/todo

Inference script
Merge to 16-bit precision

Source and further reading

LoRA paper: https://arxiv.org/abs/2106.09685
QLoRA paper: https://arxiv.org/abs/2305.14314
SFTTrainer documentation: https://huggingface.co/docs/trl/main/en/sft_trainer
Excellent LLM notebooks: https://github.com/mlabonne/llm-course

Space shortcuts

Page tree