Lu Jiang
[introductory/intermediate] Transformers for Image and Video Generation: Fundamentals, Design, and Innovations
Summary
The course explores recent topics in visual generation using deep learning techniques, including transformers and Variational AutoEncoders (VAEs). Participants will gain insights into the evolution of transformers for image and video generation, with practical lessons and advanced training techniques. The sessions are tailored for introductory to intermediate levels, covering non-autoregressive, autoregressive, and diffusion-based transformer methods, as well as the representation learning of VAEs.
Syllabus
1. Transformers for Visual Generation – A Personal Journey [introductory]
- History and resurgence of transformers in image and video generation
- Fundamentals and challenges in transformer-based video generation
- Autoregressive vs. non-autoregressive transformers
- Autoregressive/LLM-based approaches (e.g., VideoPoet)
- Diffusion-based transformers (e.g., WALT)
2. Introduction to the Representation of Visual Generation using Variational AutoEncoders (VAEs) [intermediate]
- Overview of VAEs in image and video generation VAE objectives
- Architectures and models (discrete/continuous representations)
- Challenges in designing effective VAEs
- Notable work on VAEs and advanced training techniques
References
Generative Pretraining From Pixels
Neural Discrete Representation Learning
Zero-Shot Text-to-Image Generation
Scaling Autoregressive Models for Content-Rich Text-to-Image Generation
VideoPoet: A Large Language Model for Zero-Shot Video Generation
MaskGIT: Masked Generative Image Transformer
Muse: Text-To-Image Generation via Masked Generative Transformers
Phenaki: Variable Length Video Generation From Open Domain Textual Description
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
Photorealistic Video Generation with Diffusion Models
Video generation models as world simulators
Pre-requisites
A preliminary understanding of visual generation concepts, including basic knowledge of transformers and VAEs, is recommended.
Short bio
Lu Jiang is currently a research lead at ByteDance USA. Prior to this, he served as a staff research scientist and manager at Google. His research has been integral in multiple Google products, such as YouTube, Cloud, AutoML, Ads, Waymo, and Translate, impacting the daily lives of billions of users worldwide. His research interests lie in the interdisciplinary field of Multimedia and Machine Learning, with a focus on video creation and multimodal foundation models. His work has been awarded the Best Paper at top machine learning conferences such as ICML and IJCAI-JAIR, and has also been nominated for the Best Paper at ACL and CVPR. Lu is an active member of the research community, serving as an AI panelist for America’s Seed Fund (NSF SBIR). He regularly acts as an area chair for conferences like CVPR, ICCV, ICML, ICLR, NeurIPS, ACM Multimedia, and as an associate editor for CVIU and IEEE TPAMI, TMLR.