
Ming-Hsuan Yang
[advanced] Recent Advances in Multimodal Understanding and Generation
Summary
This course provides an overview of recent advances in multimodal understanding and generation, with a primary emphasis on computer vision. It introduces core modeling paradigms, including vision-language representation learning, multimodal fusion, and generative frameworks, while highlighting practical system design and emerging research trends. The course also examines reasoning mechanisms in multimodal models, such as grounding, planning, and compositional inference, and discusses open challenges and future directions toward more capable and general multimodal intelligence.
Syllabus
Part 1: Fundamentals and Multimodal Understanding
- Core foundations
- Transformers
- Tokenization across modalities
- Multimodal understanding foundations
- Alignment, fusion, and grounding
- Dense perception and semantic alignment
- Representation geometry
- Contrastive learning
- Embedding structure
- CLIP paradigm
- Architecture evolution
- Dual encoder, cross attention, and multimodal large language model
Part 2: Multimodal Generation and Reasoning
- Generation as conditional modeling
- Diffusion fundamentals
- Structured visual distribution
- Conditional mechanism
- Cross attention condition
- Token fusion
- ControlNet
- Multimodal LLM architecture
- Vision encoder and LLM
- Projection layers
- Instruction tuning
- Understanding vs generation gap
- Hallucination
- Reasoning limitations
Part 3: Advanced and Recent Topics
- Unified multimodal models
- Unified token space
- Scaling hypothesis
- Temporal and 3D multimodal understanding
- Video-language models
- Long temporal context
- Multimodal world models and agents
- Perception, prediction, and action
- Robotics and autonomous driving
- Efficiency and training paradigms
- Distillation
- Modular architectures
- Open problems
- Geometry-aware multimodal models
- Reliable grounding
- Dense reasoning
References
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey.
Lecture notes from Stanford CME 295, CSE 231, UW CSe 392G1, CVPR 2022 Tutorial on Multimodal Machine Learning, NeurIPS 2022 Tutorial on Flamingo, CVPR 2023 Multimodal Large Language Model Tutorial.
Pre-requisites
Participants are expected to have first-year graduate-level knowledge in linear algebra, machine learning, deep learning, and computer vision. Familiarity with neural network architectures (e.g., CNNs and transformers), optimization methods, and basic probabilistic modeling is recommended. Prior exposure to vision-language models or generative models is helpful but not required.
Short bio
Ming-Hsuan Yang is a Professor at the University of California, Merced, and a Research Scientist at Google DeepMind. His research has received numerous honors, including the Google Faculty Award (2009), NSF CAREER Award (2012), NVIDIA Pioneer Research Awards (2017, 2018), and the Sony Faculty Award (2025). He has received Best Paper Honorable Mentions at UIST 2017 and CVPR 2018, the Best Student Paper Honorable Mention at ACCV 2018, the Longuet-Higgins Prize (Test-of-Time Award) at CVPR 2023, Best Paper Award at ICML 2024, and the Test-of-Time Award from WACV 2025. He has been recognized as a Highly Cited Researcher from 2018 to 2025. Prof. Yang currently serves as Associate Editor-in-Chief of IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) and as an Associate Editor of the International Journal of Computer Vision (IJCV). He is a Fellow of IEEE, ACM, AAAI, and AAAS.

















