Yao Wang
[introductory/intermediate] Deep Learning for Computer Vision
Summary
This course is targeted for an audience who are relative beginners in using deep learning to solve computer vision problems. We will start with basics of supervised learning, and then focus of convolutional networks for a variety of computer vision applications, and will end with self-supervised learning for overcoming the challenge with limited annotated data. We will introduce various fundamental concepts in convolutional networks along the way.
Syllabus
- Supervised learning basics: Neural Net classifier, training a neural net through minimizing a loss function, gradient descent through back-propagation, stochastic gradient descent, data preprocessing and regularization, training/validation/testing pipelines.
- Convolutional networks for image recognition: Why using 2D convolutions and many layers, multichannel 2D convolution, spatial dimension reduction through pooling, evolution of network structures (VGG, ResNet, DenseNet, Attention, Nonlocal networks, vision Transformer). Data augmentation and transfer learning to handle limited data.
- Convolutional networks for video and medical volumetric data: using 3D convolution layers.
- Interpretation of trained networks: gradient-based, class activation map (CAM).
- Fully convolutional networks for image to image mapping: auto-encoder, multi-resolution auto-encoder (U-Net, V-Net). Applications in image denoising, segmentation, super resolution.
- Convolutional networks for object detection (Faster R-CNN, Yolo), instance segmentation (mask R-CNN), and object tracking.
- Other computer vision tasks: body pose estimation (generating body skeleton), depth estimation from binocular and monocular images, motion estimation, video prediction and interpolation.
- Video processing through recurrent convolutional networks: convolutional LSTM, applications for action recognition, object tracking, video prediction.
- Overcoming limited data through self-supervision: contrastive energy based, non-contrastive energy based, masked auto-encoders, multi-modality supervisions (image text, and audio).
References
Pre-requisites
Enrolled students should have basic knowledge in linear algebra, statistics and probability. Prior exposure to classical image processing and computer vision will be a plus but not required.
Short bio
Yao Wang is a Professor at New York University Tandon School of Engineering (formerly Polytechnic University, Brooklyn, NY), with joint appointment in Departments of Electrical and Computer Engineering and Biomedical Engineering. She is also Associate Dean for Faculty Affairs for NYU Tandon since June 2019. Her research areas include video coding and streaming, multimedia signal processing, computer vision, and medical imaging. She is the leading author of a textbook titled Video Processing and Communications, and has published over 250 papers in journals and conference proceedings. She received New York City Mayor’s Award for Excellence in Science and Technology in the Young Investigator Category in year 2000. She was elected Fellow of the IEEE in 2004 for contributions to video processing and communications. She received the IEEE Communications Society Leonard G. Abraham Prize Paper Award in the Field of Communications Systems in 2004, and the IEEE Communications Society Multimedia Communication Technical Committee Best Paper Award in 2011. She was a keynote speaker at the 2010 International Packet Video Workshop, INFOCOM Workshop on Contemporary Video in 2014, the 2018 Picture Coding Symposium, and the 2020 ACM Multimedia Systems Conference (MMSys’20). She received the NYU Tandon Distinguished Teacher Award in 2016.