Fine-Tuning an LLM Model Using LoRA
Around a month ago, Apple released a report called “Introducing Apple’s On-Device and Server Foundation Models” which talks about how you can use “adaptors” to fine-tune a foundation large language model i.e. GPT-3 or Llama2 etc. In this post, (using the Apple research article) we will look at how we can use an external dataset to fine-tune one of these models using a technique called LoRA.
Why would we want to do this?
For a start, LLMs are trained on billions of parameters, and as such can be quite large, so having a separate “specialised” LLM for each task that we carry out is not a viable option moving forward. LoRA is extremely efficient and makes fine-tuning a lot faster. It generates small “adaptors” that you can “plug and play” with an LLM to get that model to solve highly specified tasks.
In the article there is a section called “Model Adaptation” and in that section, Apple gives us a breakdown of this technique. To paraphrase the article section: We create adaptors (small neural networks) that can be plugged into various layers of the foundation model; fine-tuning them for specific tasks. These adaptors can dynamically adjust themselves on-the-fly for the specific task that the user is currently engaged in.
Think of it like this – it’s a bit like having an iPod. If you don’t want other people to hear your music then you would plug in headphones but if you were at a party and wanted everyone else to enjoy listening to the music then you would plug in some speakers. You don’t need a separate iPod for each of these tasks.
This means we can have one foundation model, and a number of these adaptors that specialise in specific tasks like email summarisation or query handling etc. In this post, we’re going to attempt to understand what these adaptors are and how they work.
What does this look like?
Once we have generated our specialised adaptors, we can then load them in together with the foundation model to dynamically transform the models capabilities.
What do the nuts and bolts look like (i.e. the maths)?
I’ve written a few posts on matrices in the past: Introduction to Matrices, Matrix Operations and Application of Matrices if you need a bit of catching up at this point.
Below we have three matrices: purple represents the original weights of the foundation model, green represents our “adaptor” weights (that we use to adjust the foundation models weights), and blue is the specialised models (fine-tuned) weights.
\\ Numbers used above are hypothetical and are for demonstration purposes.
So that’s a rough idea of how we are going to get this to work. The only problem is, the new “fine-tuned” model is the same size as the original. This is no good to us as we’re looking at building small “adaptors” that influence the behaviour of the original foundation model.
That’s why LoRA is so freaking awesome because we don’t need to fine-tune the entire matrix. Rather we can get away by fine-tuning two matrices of lower rank. When we multiply these together we will get the weights update and then we’ll need to apply that to the foundation model which will then modify its capabilities.
This is where the maths kicks in. Turns out we can use matrix decomposition to represent the “adaptor” weights (shown in the same colour below) which allows us to represent the same thing i.e. the two matrices of lesser rank.
We’ve just turned 25 values (that represent weights in our original model) into 10 values. This may not seem that good but when we look at how this scales we can see the benefits.
Number of Parameters | Original Matrix Dimensions | LoRA Parameters | Savings |
---|---|---|---|
25 | 5×5 | 10 | 60% |
1M | 1000×1000 | 2000 | 99.8% |
2B | ~44kx44k | ~89k | 99.995% |
7B | ~83kx83k | ~167k | 99.997% |
13B | ~114kx114k | ~228k | 99.998% |
In the first line (in the table above) we can see the “savings” from a simple 25-parameter model is 60% but when we start talking about 13,000,000,000 parameters we can see that this scales really well and produces some really powerful results i.e. 13,000,000,000 can be reduced down to ~228,000 giving a “saving” of ~99.998% in size. This means that when we’re fine-tuning the model, we can do so really quickly (and easily), because we’re only adjusting a few of the parameters.