Louis-Philippe Morency
[intermediate/advanced] Multimodal Machine Learning
Summary
Multimodal machine learning is a vibrant multi-disciplinary research field that addresses some of the original goals of AI by integrating and modeling multiple communicative modalities, including linguistic, acoustic, and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning, visual question answering, and language-guided reinforcement learning, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. This course will teach fundamental mathematical concepts related to Multimodal Machine Learning including multimodal representations, alignment, fusion, reasoning and quantification. We will also review recent papers describing state-of-the-art multimodal models and computational algorithms addressing these technical challenges.
Syllabus
Introduction
- What is Multimodal? Historical view, multimodal, and multimedia.
- Multimodal applications and datasets: image captioning, video description, AVSR, affect recognition, multimodal RL.
- Core technical challenges: representation, alignment, reasoning, generation, co-learning, and quantification.
Unimodal representations
- Language representations: Distributional hypothesis and text embeddings.
- Visual representations: Convolutional networks, self-attention models.
- Acoustic representations: Spectrograms, auto-encoders.
Multimodal representations
- Representation fusion: visuo-linguistic spaces, multimodal auto-encoder, fusion strategies.
- Representation coordination: similarity metrics, canonical correlation analysis, multimodal transformers.
- Representation fission: factorization, component analysis, disentanglement.
Modality alignment
- Latent alignment approaches: Attention models, multimodal transformers, multi-instance learning.
- Explicit alignment: Dynamic time warping.
Multimodal reasoning
- Hierarchical and graphical representations.
- Leveraging external data: external knowledge bases, commonsense reasoning.
Multimodal co-learning & generation
- Modality transfer: Cross-modal domain adaptation, few-shot learning.
- Compression (multimodal summarization), transduction (multimodal style transfer), and creation (multimodal conditional generation).
Multimodal quantification
- Dataset biases: social biases, spurious correlations.
- Model biases: modality collapse, robustness, optimization challenges, interpretability.
Future directions and conclusion
References
15-week course on Multimodal Machine Learning, including all video lectures:
https://cmu-multicomp-lab.github.io/mmml-course/fall2020/
Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423-443.
https://arxiv.org/abs/1705.09406
Reading list, and lecture slides of Fall 2021 edition of the CMU Multimodal Machine Learning course:
https://piazza.com/cmu/fall2021/11777/resources
Pre-requisites
We expect the audience to have an introductory background in machine learning and deep learning, including a basic familiarity of commonly-used unimodal building blocks such as convolutional, recurrent, and self-attention models. We also expect an understanding of math, CS, and programming at an introductory graduate level.
Short bio
Louis-Philippe Morency is Associate Professor in the Language Technology Institute at Carnegie Mellon University where he leads the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab). He was formerly research faculty in the Computer Sciences Department at University of Southern California and received his Ph.D. degree from MIT Computer Science and Artificial Intelligence Laboratory. His research focuses on building the computational foundations to enable computers with the abilities to analyze, recognize and predict subtle human communicative behaviors during social interactions. He received diverse awards including AI’s 10 to Watch by IEEE Intelligent Systems, NetExplo Award in partnership with UNESCO and 10 best paper awards at IEEE and ACM conferences. His research was covered by media outlets such as Wall Street Journal, The Economist and NPR. He is currently chair of the advisory committee for ACM International Conference on Multimodal Interaction and associate editor at IEEE Transactions on Affective Computing.