Atlas Wang
[intermediate] Low Rank Strikes Back in the Era of Large Language Models
Summary
This tutorial explores the growing importance of low-rank approximation techniques in the context of large language models (LLMs). The session covers theoretical foundations, empirical observations, and practical applications of low-rank structures to enhance efficiency, interpretability, and robustness in LLMs. Topics include attention approximation, weight compression, gradient projection, and fine-tuning. Participants will gain insights into how low-rank methods reduce computational complexity and improve mechanistic understanding of LLMs.
Syllabus
Session I: Low-Rank Attention Approximation
- Overview of attention mechanisms in LLMs.
- Computational challenges and low-rank approximation solutions.
- Recent advances connecting low-rank attention with state-space models and efficient inference.
Session II: Low-Rank Gradient Structures
- Emergent low-rank structures in gradients during training.
- Gradient low-rank projection (GaLore) for memory-efficient training.
- Convergence analysis and empirical evaluations.
Session III: Low-Rank Structures in Weights and Features
- Matrix and tensor decomposition for compression and fine-tuning.
- Phenomena of low-rank collapse in token spaces.
- Generalization and safety implications of low-rank modifications.
Open Research Questions
- Interplay of low-rankness, sparsity, and quantization.
- Mechanistic interpretability and theoretical understanding.
References
John Wright and Yi Ma. High-dimensional data analysis with low-dimensional models: Principles, computation, and applications. Cambridge University Press, 2022.
Emmanuel J. Candès, Xiaodong Li, Yi Ma, and John Wright. Robust principal component analysis? Journal of the ACM, 58(3):1–37, 2011.
Ehsan Elhamifar and René Vidal. Sparse subspace clustering: Algorithm, theory, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11):2765–2781, 2013.
Mingxue Xu, Yao Lei Xu, and Danilo P. Mandic. Tensorgpt: Efficient compression of the embedding layer in LLMs based on the tensor-train decomposition. https://arxiv.org/pdf/2307.00526, 2023.
Zhihang Yuan, Yuzhang Shang, Yue Song, Qiang Wu, Yan Yan, and Guangyu Sun. ASVD: Activation-aware singular value decomposition for compressing large language models. https://arxiv.org/pdf/2312.05821, 2023.
Ayush Kaushal, Tejas Vaidhya, and Irina Rish. LORD: Low rank decomposition of monolingual code LLMs for one-shot compression. https://arxiv.org/pdf/2309.14021, 2023.
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. https://arxiv.org/pdf/2006.04768, 2020.
Beidi Chen, Tri Dao, Eric Winsor, Zhao Song, Atri Rudra, and Christopher Ré. Scatterbrain: Unifying sparse and low-rank attention. NeurIPS, 34:17413–17426, 2021.
Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen. Get more with less: Synthesizing recurrence with KV cache compression for efficient LLM inference. ICML, 2024.
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. ICLR, 2022.
Soufiane Hayou, Nikhil Ghosh, and Bin Yu. LoRA+: Efficient low rank adaptation of large models. ICML, 2024.
Rui Pan, Xiang Liu, Shizhe Diao, Renjie Pi, Jipeng Zhang, Chi Han, and Tong Zhang. LISA: Layerwise importance sampling for memory-efficient large language model fine-tuning. https://arxiv.org/pdf/2403.17919, 2024.
Shih-Yang Liu, Chien-Yi Wang, Hongxu Yin, Pavlo Molchanov, Yu-Chiang Frank Wang, Kwang-Ting Cheng, and Min-Hung Chen. DORA: Weight-decomposed low-rank adaptation. ICML, 2024.
Vladislav Lialin, Sherin Muckatira, Namrata Shivagunde, and Anna Rumshisky. ReLoRA: High-rank training through low-rank updates. ICLR, 2024.
Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-efficient LLM training by gradient low-rank projection. ICML, 2024.
Zi Yang, Samridhi Choudhary, Xinfeng Xie, Cao Gao, Siegfried Kunzmann, and Zheng Zhang. CoMERA: Computing-and memory-efficient training via rank-adaptive tensor optimization. https://arxiv.org/pdf/2405.14377, 2024.
Boyi Wei, Kaixuan Huang, Yangsibo Huang, Tinghao Xie, Xiangyu Qi, Mengzhou Xia, Prateek Mittal, Mengdi Wang, and Peter Henderson. Assessing the brittleness of safety alignment via pruning and low-rank modifications. ICML, 2024.
Srinadh Bhojanapalli, Chulhee Yun, Ankit Singh Rawat, Sashank Reddi, and Sanjiv Kumar. Low-rank bottleneck in multi-head attention models. ICML, PMLR, 119:864–873, 2020.
Jialin Mao, Itay Griniasty, Han Kheng Teoh, Rahul Ramesh, Rubing Yang, Mark K. Transtrum, James P. Sethna, and Pratik Chaudhari. The training process of many deep networks explores the same low-dimensional manifold. PNAS, 121(12):e2310002121, 2024.
Vardan Papyan, X.Y. Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. PNAS, 117(40):24652–24663, 2020.
Yihe Dong, Jean-Baptiste Cordonnier, and Andreas Loukas. Attention is not all you need: Pure attention loses rank doubly exponentially with depth. ICML, PMLR, 139:2793–2803, 2021.
Pratyusha Sharma, Jordan T. Ash, and Dipendra Misra. The truth is in there: Improving reasoning in language models with layer-selective rank reduction. ICLR, 2024.
Dan Biderman, Jose Gonzalez Ortiz, Jacob Portes, Mansheej Paul, Philip Greengard, Connor Jennings, Daniel King, Sam Havens, Vitaliy Chiley, Jonathan Frankle, et al. LoRA learns less and forgets less. https://arxiv.org/pdf/2405.09673, 2024.
Pre-requisites
Basic understanding of machine learning principles, including neural networks and language models. Familiarity with attention mechanisms and optimization techniques. Foundational knowledge in linear algebra and matrix decompositions is helpful but not mandatory.
Short bio
Professor Zhangyang “Atlas” Wang is a tenured Associate Professor at The University of Texas at Austin, holding the Temple Foundation Endowed Faculty Fellowship. He is currently on leave to serve as Research Director for XTX Markets, leading AI innovations in algorithmic trading. His research spans machine learning, optimization, generative AI, and neurosymbolic AI, with a focus on low-dimensional representations for efficient and reliable learning. Prof. Wang has received numerous awards, including the NSF CAREER Award and IEEE AI’s 10 To Watch, and has mentored students who have won many prestigious fellowships. He is an ACM Distinguished Speaker and IEEE Senior Member. See his full bio at: https://vita-group.github.io/research.html.