Lu Jiang

ByteDance & Carnegie Mellon University

[introductory/intermediate] Transformers for Image and Video Generation: Fundamentals, Design, and Innovations

Summary

The course explores recent topics in visual generation using deep learning techniques, including transformers and Variational AutoEncoders (VAEs). Participants will gain insights into the evolution of transformers for image and video generation, with practical lessons and advanced training techniques. The sessions are tailored for introductory to intermediate levels, covering non-autoregressive, autoregressive, and diffusion-based transformer methods, as well as the representation learning of VAEs.

Syllabus

1. Transformers for Visual Generation – A Personal Journey [introductory]

History and resurgence of transformers in image and video generation
Fundamentals and challenges in transformer-based video generation
Autoregressive vs. non-autoregressive transformers
Autoregressive/LLM-based approaches (e.g., VideoPoet)
Diffusion-based transformers (e.g., WALT)

2. Introduction to the Representation of Visual Generation using Variational AutoEncoders (VAEs) [intermediate]

Overview of VAEs in image and video generation VAE objectives
Architectures and models (discrete/continuous representations)
Challenges in designing effective VAEs
Notable work on VAEs and advanced training techniques

References

Generative Pretraining From Pixels

Neural Discrete Representation Learning

Zero-Shot Text-to-Image Generation

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

VideoPoet: A Large Language Model for Zero-Shot Video Generation

MaskGIT: Masked Generative Image Transformer

Muse: Text-To-Image Generation via Masked Generative Transformers

Phenaki: Variable Length Video Generation From Open Domain Textual Description

Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding

Photorealistic Video Generation with Diffusion Models

Video generation models as world simulators

Pre-requisites

A preliminary understanding of visual generation concepts, including basic knowledge of transformers and VAEs, is recommended.

Short bio

Lu Jiang is currently a research lead at ByteDance USA. Prior to this, he served as a staff research scientist and manager at Google. His research has been integral in multiple Google products, such as YouTube, Cloud, AutoML, Ads, Waymo, and Translate, impacting the daily lives of billions of users worldwide. His research interests lie in the interdisciplinary field of Multimedia and Machine Learning, with a focus on video creation and multimodal foundation models. His work has been awarded the Best Paper at top machine learning conferences such as ICML and IJCAI-JAIR, and has also been nominated for the Best Paper at ACL and CVPR. Lu is an active member of the research community, serving as an AI panelist for America’s Seed Fund (NSF SBIR). He regularly acts as an area chair for conferences like CVPR, ICCV, ICML, ICLR, NeurIPS, ACM Multimedia, and as an associate editor for CVIU and IEEE TPAMI, TMLR.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_74880351_9	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.