Yingbin Liang
[intermediate/advanced] Theory on Training Dynamics of Transformers
Summary
Transformers, as foundation models, have recently revolutionized many machine learning (ML) applications. Alongside their tremendous experimental successes, theoretical studies have also emerged to explain why transformers can be trained to achieve fantastic performance. This tutorial aims to provide an overview of these recent theoretical investigations that have characterized the training dynamics of transformer-based ML models. Additionally, the tutorial will explain the primary techniques and tools employed for such analyses, which leverage various information theoretical concepts and tools in addition to learning theory, stochastic optimization, dynamical systems, probability, etc.
Syllabus
The tutorial will begin with an introduction to basic transformer models, and then delve into several ML problems where transformers have found extensive applications, such as in-context learning, next token prediction, and self-supervised learning. For each learning problem, the tutorial will go over the problem formulation, the main theoretical techniques for characterizing the training process, the convergence guarantee and the optimality of the attention models at the time of convergence, the implications to the learning problem, and the insights and guidelines to practical solutions. Finally, the tutorial will discuss future directions and open problems in this actively evolving field.
References
Yu Huang, Yuan Cheng, Yingbin Liang. “In-context convergence of transformers”, Proc. International Conference on Machine Learning (ICML), 2024.
Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi. “In-context learning with representations: Contextual generalization of trained transformers”, Proc. Advances in Neural Information Processing Systems (NeurIPS), 2024.
Ruiquan Huang, Yingbin Liang, Jing Yang. “Non-asymptotic convergence of training transformers for next-token prediction”, Proc. Advances in Neural Information Processing Systems (NeurIPS), 2024.
Yu Huang, Zixin Wen, Yuejie Chi, Yingbin Liang. “How transformers learn diverse attention correlations in masked vision pretraining”, arXiv 2403.02233, 2024.
Pre-requisites
Basics of deep learning, language model (preferred), basics of optimization, probability theory.
Short bio
Dr. Yingbin Liang is currently a Professor at the Department of Electrical and Computer Engineering at the Ohio State University (OSU), and a core faculty of the Ohio State Translational Data Analytics Institute (TDAI). She also serves as the Deputy Director of the AI-EDGE Institute at OSU. Dr. Liang received the Ph.D. degree in Electrical Engineering from the University of Illinois at Urbana-Champaign in 2005, and served on the faculty of University of Hawaii and Syracuse University before she joined OSU. Dr. Liang’s research interests include machine learning, optimization, information theory, and statistical signal processing. Dr. Liang received the National Science Foundation CAREER Award and the State of Hawaii Governor Innovation Award in 2009. She also received EURASIP Best Paper Award in 2014. She is an IEEE fellow.