Louis-Philippe Morency

Carnegie Mellon University

[intermediate/advanced] Multimodal Machine Learning

Summary

Multimodal machine learning is a vibrant multi-disciplinary research field that addresses some of the original goals of AI by integrating and modeling multiple communicative modalities, including linguistic, acoustic, and visual messages. With the initial research on audio-visual speech recognition and more recently with language & vision projects such as image and video captioning, visual question answering, and language-guided reinforcement learning, this research field brings some unique challenges for multimodal researchers given the heterogeneity of the data and the contingency often found between modalities. This course will teach fundamental mathematical concepts related to Multimodal Machine Learning including multimodal representations, alignment, fusion, reasoning and quantification. We will also review recent papers describing state-of-the-art multimodal models and computational algorithms addressing these technical challenges.

Syllabus

Introduction

What is Multimodal? Historical view, multimodal, and multimedia.
Multimodal applications and datasets: image captioning, video description, AVSR, affect recognition, multimodal RL.
Core technical challenges: representation, alignment, reasoning, generation, co-learning, and quantification.

Unimodal representations

Language representations: Distributional hypothesis and text embeddings.
Visual representations: Convolutional networks, self-attention models.
Acoustic representations: Spectrograms, auto-encoders.

Multimodal representations

Representation fusion: visuo-linguistic spaces, multimodal auto-encoder, fusion strategies.
Representation coordination: similarity metrics, canonical correlation analysis, multimodal transformers.
Representation fission: factorization, component analysis, disentanglement.

Modality alignment

Latent alignment approaches: Attention models, multimodal transformers, multi-instance learning.
Explicit alignment: Dynamic time warping.

Multimodal reasoning

Hierarchical and graphical representations.
Leveraging external data: external knowledge bases, commonsense reasoning.

Multimodal co-learning & generation

Modality transfer: Cross-modal domain adaptation, few-shot learning.
Compression (multimodal summarization), transduction (multimodal style transfer), and creation (multimodal conditional generation).

Multimodal quantification

Dataset biases: social biases, spurious correlations.
Model biases: modality collapse, robustness, optimization challenges, interpretability.

Future directions and conclusion

References

15-week course on Multimodal Machine Learning, including all video lectures:
https://cmu-multicomp-lab.github.io/mmml-course/fall2020/

Baltrušaitis, T., Ahuja, C., & Morency, L. P. (2018). Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2), 423-443.
https://arxiv.org/abs/1705.09406

Reading list, and lecture slides of Fall 2021 edition of the CMU Multimodal Machine Learning course:
https://piazza.com/cmu/fall2021/11777/resources

Pre-requisites

We expect the audience to have an introductory background in machine learning and deep learning, including a basic familiarity of commonly-used unimodal building blocks such as convolutional, recurrent, and self-attention models. We also expect an understanding of math, CS, and programming at an introductory graduate level.

Short bio

Louis-Philippe Morency is Associate Professor in the Language Technology Institute at Carnegie Mellon University where he leads the Multimodal Communication and Machine Learning Laboratory (MultiComp Lab). He was formerly research faculty in the Computer Sciences Department at University of Southern California and received his Ph.D. degree from MIT Computer Science and Artificial Intelligence Laboratory. His research focuses on building the computational foundations to enable computers with the abilities to analyze, recognize and predict subtle human communicative behaviors during social interactions. He received diverse awards including AI’s 10 to Watch by IEEE Intelligent Systems, NetExplo Award in partnership with UNESCO and 10 best paper awards at IEEE and ACM conferences. His research was covered by media outlets such as Wall Street Journal, The Economist and NPR. He is currently chair of the advisory committee for ACM International Conference on Multimodal Interaction and associate editor at IEEE Transactions on Affective Computing.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Advertisement".
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
PHPSESSID	session	This cookie is native to PHP applications. The cookie is used to store and identify a users' unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Cookie	Duration	Description
_ga	2 years	This cookie is installed by Google Analytics. The cookie is used to calculate visitor, session, campaign data and keep track of site usage for the site's analytics report. The cookies store information anonymously and assign a randomly generated number to identify unique visitors.
_gat_gtag_UA_74880351_9	1 minute	This cookie is set by Google and is used to distinguish users.
_gid	1 day	This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visted in an anonymous form.