This tutorial will showcase how to instruct finetune a large language model (LLM) on Snellius using Unsloth and serves as supplementary materials to the LLM finetune on Snellius code.
Here, we will focus on efficiently and effectively using the resources. To this end, we will adopt Unsloths QLoRA implementation. Equipping QLoRA significantly reduces the memory required for finetuning without notable loss of precision (Dettmers et al., 2023). Reducing the memory consumption also allows us to finetune larger models or increase the batch sizes!
Introduction to QLoRA
Quantized Low-Rank adaption, or QLoRA, essentially keeps the initial weights of the model frozen/static and instead learns a decomposable weight matrix that will be applied to the static foundation model. This way, we do not have to store and compute the memory-intensive gradient and optimizer states of the original LLM. The weight matrix is derived using the hyperparameter r which specifies the rank of the matrix. A higher r means a larger matrix. The original paper of LoRA and QLoRA experimented with different values or r. It turns out that the value of r does not significantly impact the performance. Typically, r is set to 16 and will result in a trainable matrix of only 0.5% of the original parameters! QLoRA extends this idea by quantizing the original LLM into 4-bit format using Normal-Float4. Consequently, the memory footprint of the stored LLM is further reduced. Keep in mind that the LLM still needs to be loaded onto the GPU to apply the weight matrix to and compute the loss!
Unsloth further optimizes the instruction finetune pipeline of QLoRA with some smart tricks and makes it a competitive framework.
Did you know: LoRA originated from diffusion models for image generation like Stable Diffusion.
Code
The code can be found here including a job script to finetune on Snellius: link
Estimating the GPU memory for your finetune task: link
Experimental setup
Model | Dataset | Context length | Hardware | Library | QLoRA | Batch size | Optimizer | # Samples | Training time |
---|---|---|---|---|---|---|---|---|---|
Llama 3.1 8B 4-bit | Open-Orca/SlimOrca | 8192 | 1 H100 GPU (94GiB) | Unsloth | 8 train / 8 eval | AdamW 8-bit | 518K | 8.5 hours |
Walkthrough
This walkthrough is supplementary material of this SURF-made codebase here
Import necessities
As always, start off by defining the imports. Make sure you have followed the installation instructions from here
All libraries are dependencies from Unsloth and will be installed for you upon installing Unsloth!
The environment variable HF_HOME is also set where the model, tokenizer and dataset will be downloaded to!
import os from trl import SFTConfig, SFTTrainer from unsloth import FastLanguageModel from unsloth.chat_templates import get_chat_template from datasets import load_dataset os.environ["HF_HOME"] = f"/scratch-shared/{os.environ["USER"]}/hf-cache-dir"
Setting up the finetune config
Next up is defining the supervised finetuning config. These are the necessary parameters we need to pass to the trainer.
The output directory will be used to store intermediate and the final checkpoint of the adapter model and some log information.
The maximum sequence length (max_seq_length) signifies the ability for the LLM to comprehend the information in a single pass. Thus, a sequence (or context) length of 8192 means the model can utilize its window span of 8192 tokens (roughly 10,000 words) to retrieve relevant information and return an answer.
max_seq_length = 8192 training_args = SFTConfig( dataset_text_field="text", max_seq_length=max_seq_length, output_dir=f"/scratch-shared/{os.environ["USER"]}/qlora_finetune", )
Loading the model and tokenizer
Let's start by loading the model and tokenizer. As we use QLoRA, we want to load the model in 4-bit precision. Here, we load a pre-quantized 4-bit model by Unsloth to reduce the quantization time. Alternatively, you can load a different model (see here for all supported Unsloth models) even when it's not quantized like the official Meta-Llama-3.1 from HuggingFace. Keep in mind that some of these models are 'gated' models and can only be downloaded after accepting terms and conditions. Please follow this tutorial if you encountered a 'gated' model or dataset.
model, tokenizer = FastLanguageModel.from_pretrained( model_name="unsloth/Meta-Llama-3.1-8B-bnb-4bit", max_seq_length=max_seq_length, device_map="auto", dtype=None, # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+ load_in_4bit=True, )
Loading the dataset
As a dataset, we opted for Open-Orca/SlimOrca due to its standard formatting, easy for integration into these frameworks and applicability for targeted instruction learning.
Because there is no explicit chat template defined for foundation models, we choose for the "chatml" template which is one the standards nowadays. In case you are finetuning from an existing instruct-finetuned LLM, then skip the foundation_model flag and avoid getting the chat template.
Lastly, a mapping is performed to convert the nested dictionary samples into a list of formatted strings with their corresponding chat template. An example to illustrate:
{ "from": "system", "value": "You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.", "weight": null }, { "from": "human", "value": "Please answer the following question: - They are buried under layers of soil - Pressure builds over time - The remains liquefy - The carbon atoms rearrange to become a new substance. What might be the first step of the process?\nA:", "weight": 0 }, { "from": "gpt", "value": "A: The first step of the process is \"They are buried under layers of soil.\" This occurs when the remains of plants, animals, or other organic material become covered by soil and other sediments. Over time, as more and more layers accumulate, the pressure and heat increase, eventually leading to the transformation of the remains into substances like coal, oil, or natural gas.", "weight": 1 } ]
is converted to a single string of:
system: You are an AI assistant. Provide a detailed answer so user don’t need to search outside to understand the answer.
user: Please answer the following question: - They are buried under layers of soil - Pressure builds over time - The remains liquefy - The carbon atoms rearrange to become a new substance. What might be the first step of the process?
assistant", "value": "A: The first step of the process is \"They are buried under layers of soil.\" This occurs when the remains of plants, animals, or other organic material become covered by soil and other sediments. Over time, as more and more layers accumulate, the pressure and heat increase, eventually leading to the transformation of the remains into substances like coal, oil, or natural gas.
Keep in mind that especially this step is subject to change for your dataset!
dataset = load_dataset("Open-Orca/SlimOrca", split="train") # Define own template if finetuning from pre-trained model. If continue from a instruct finetune, then use the native tokenizer and chat template if from_foundation_model: tokenizer = get_chat_template( tokenizer, mapping={ "role": "from", "content": "value", "user": "human", "assistant": "gpt", }, chat_template="chatml", map_eos_token=True, ) # Dataset-specific function to convert the samples (in dictionaries) to strings with corresponding template def formatting_prompts_func(examples): convos = examples["conversations"] texts = [ tokenizer.apply_chat_template( convo, tokenize=False, add_generation_prompt=False ) for convo in convos ] return { "text": texts, } # Apply applying the samples dictionaries to string dataset = dataset.map( formatting_prompts_func, batched=True, num_proc=os.cpu_count() // 2 )
QLoRA magic
Below, the QLoRA is applied to the original model. The QLoRA hyperparameters are set here. Generally, these settings work well already. If you want to know more, we recommend reading this blog post.
model = FastLanguageModel.get_peft_model( model, r=16, # rank of parameters. Higher R means more parameters target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", ], lora_alpha=16, # scaling of the weights lora_dropout=0, # Dropout = 0 is currently optimized bias="none", # Bias = "none" is currently optimized use_gradient_checkpointing="unsloth", max_seq_length=max_seq_length, random_state=47, )
Start finetuning!
Lastly, initialize the trainer and already start training. It's that easy!
After training, the static model and the weight matrix, or adapter, is saved at the output directory in your /scratch-shared/ folder. An adapter can then be merged with the original foundation model to obtain the final model.
trainer = SFTTrainer( model=model, args=sft_config, train_dataset=dataset, ) trainer_stats = trainer.train() print(trainer_stats)
Running on SLURM
See here for the complete runnable codebase including job script
Next up/todo
- Inference script
- Merge to 16-bit precision
Source and further reading
- LoRA paper: https://arxiv.org/abs/2106.09685
- QLoRA paper: https://arxiv.org/abs/2305.14314
- SFTTrainer documentation: https://huggingface.co/docs/trl/main/en/sft_trainer
- Excellent LLM notebooks: https://github.com/mlabonne/llm-course