Low-rank adaptation (LoRA)
Goal: Reduce fine-tuning cost by injecting trainable rank decomposition matrices.
Contribution: Reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times, while introducing no additional inference latency.
Concept
- Hypothesize that the change in weights during model adaptation has a low intrinsic rank.
- For many deep learning tasks with a heavily over-parametrized neural network, the learned neural network will enjoy low-rank properties after training. (Oymak et al., 2019)
- For a pre-trained weight matrix
, constrain its update by defining the merged weight matrix as , where , and rank . is initialized using a Guassian distribution, while is initialized with zero, such that at the beginning of fine-tuning.- This study is limited to only adapting the attention weights (
). Adapting other layers such as MLP, LayerNorm, and biases are left for future works.
Architecture of LoRA, from Figure 1 of Hu et al., 2022.
Advantages
- A Generalization of Full Fine-tuning. Recovers the expressiveness of full fine-tuning by setting
. - No Additional Inference Latency. Pre-compute
before inference. When switching to another downstream task, subtract and add . - Lower hardware requirements. Reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times.
Still have limitations such as difficulties on batching inputs of different downstream tasks.
Related Works
Some drawbacks of other fine-tuning methods:
-
Adapter Layers. Keep the parameters of the pre-trained model frozen. Add two adapter layers per transformer block, and only update these adapter layers during fine-tuning.
- No direct ways to bypass the extra compute in adapter layers.
- The small bottleneck in the adapter layers limits hardware parallelism, increasing latency.
Architecture of the adapter module, from Figure 2 of Houlsby et al., 2019.
-
Prefix Tuning. Keep the parameters of the pre-trained model frozen. Prepends a sequence of continuous task-specific vectors (i.e., prefix) to the input sequence, and only update these vectors (i.e., soft prompt) during fine-tuning.
- Difficult to optimize, its performance changes non-monotonically in trainable parameters.
- Reduces the sequence length available for downstream tasks.
Prefix-Tuning, from Figure 1 of Li et al., 2021.
Experiments
Benchmark
Inference Latency, from Table 1 of Hu et al., 2022.
Performance, from Table 4 of Hu et al., 2022.
Analysis
-
Which weight matrices in Transformer should we apply LoRA to?
Applying LoRA to different types of attention weights, from Table 5 of Hu et al., 2022.
Apply LoRA to
. -
Optimal rank
for LoRA?Applying LoRA with different ranks, from Table 6 of Hu et al., 2022.
Apply LoRA with
.Applying LoRA with different ranks, from Table 18 of Hu et al., 2022.
-
Does
highly correlate with ?Settings:
- Learn adaptation matrices
and with and , respectively. - Perform singular value decomposition (SVD) and obtain the right-singular unitary matrices
and - Use Grassmann distance
to measure how much of the subspace spanned by the top singular vectors in is contained in the subspace spanned by top singular vectors of
Subspace similarity between
and , from Figure 3 of Hu et al., 2022.Observations:
- Top singular vector overlap significantly between
and , while others do not. - These top singular-vector directions are the most useful, while others contain mostly random noises.
and share a subspace of dimension 1 with normalized similarity , which verifies the experiments that can still perform quite well.
Subspace similarity between two
fine-tuned with different random seeds, from Figure 4 of Hu et al., 2022.Observations:
- Similarly, the top singular-vector directions are the most useful, while others contain mostly random noises.
Settings:
- Project
onto the -dimensional subspace of .
Correlation between
and and Amplication Factor of , from Table 7 of Hu et al., 2022.Observations:
has a stronger correlation with ( , ) compared to a random matrix ( , ), indicating that amplifies some features that are already in .- Instead of repeating the top singular directions of
, only amplifies directions that are not emphasized in . ( ) - The amplification factor is rather huge:
for . - Conclusion. The low-rank adaptation matrix potentially amplifies the important features for specific downstream tasks that were learned but not emphasized in the general pre-training model.
Subspace similarity between
and with different , from Figure 8 of Hu et al., 2022. - Learn adaptation matrices
Official Resources
- [ICLR 2022] LoRA: Low-Rank Adaptation of Large Language Models [arxiv][code][paper][video] (citations: 203, 253, as of 2023-03-22)
Community Resources
Low-Rank Adaptation of Large Language Models (LoRA), by HuggingFace