Human perception relies to a large extent on vision, and we experience our vision-based understanding of the world as something intuitive, natural, and simple. The key to our visual perception is a feature abstraction pipeline that starts already in the retina. Thanks to this learned feature representation we are able to recognize of objects and scenes from just a few examples, do visual 3D navigation and manipulation, and understand of poses and gestures in real-time. Modern machine learning has brought similar capabilities to computer vision and the key to this progress are the internal feature representations that are learned from generic or problem specific datasets to solve a wide range of classification and regression problems.
Course type:
- AS track: elective
- AI track: elective
- Joint Curriculum: advanced
Time: Given even years, Autumn
Teachers: Fredrik Lindsten (LiU), Michael Felsberg (LiU), Per-Erik Forssen (LiU)
Examiner: Per-Erik Forssen (LiU)
The participants are assumed to have a background in mathematics corresponding to the contents of the WASP-course “Mathematics and Machine Learning”.
Module 1 and 3: Knowledge of calculus, linear algebra and especially probability theory is very helpful. Basic understanding of machine learning is preferred. Programming skills in any language.
Module 2: Knowledge about advanced linear algebra, basics in machine learning, signal processing, and image analysis are required. Programming skills in Python+Numpy.
We recommend that you refresh signal processing knowledge (convolution, correlation, Fourier transform, complex functions), optimization (ridge regression) and differential and integral calculus before the course.
Module 1. Students will gain a principled understanding of how unsupervised and self-supervised methods learn compressed, informative and transferrable representations from data. They will be able to compare modern approaches and assess trade-offs in their theoretical foundations and practical implementation.
Module 2. Be able to use concepts from computer vision learning such as generative and discriminative models, invariance and equivariance, and open-world problems in the design of algorithms. Implement state-of-the-art algorithms for visual object tracking.
Module 3. Recognize and explain many useful relations in 3D geometry and projective geometry and understand how they can be incorporated in deep neural networks.
Module 1. Self-Supervised Representation Learning, Fredrik Lindsten
A central goal of machine learning is to learn how to extract meaningful, generic features from data that capture underlying structure and remain useful across a range of downstream tasks. In this module, we explore a variety of principled approaches to learning such representations in an unsupervised or self-supervised setting. We will examine how different model families and training objectives shape the representations that emerge, and what theoretical and practical trade-offs they involve. The emphasis will be on modern self-supervised approaches as well as the estimation and optimization techniques that make these methods tractable.
Module 2. Learning of discriminative models, Michael Felsberg
Visual representations can be categorized into generative and discriminative models, depending on whether they are supposed to represent visual appearance explicitly or implicitly. An explicit representation is typically an image patch of a part of a feature map from a deep network. Implicit representations are dual to image patches or feature maps, in the sense that they are optimal for a discriminative task, such as localization, detection, or classification. In particular, we will look into the problem of video analysis: object tracking and segmentation. We will consider various techniques such as correlation filter and vision transformers.
Module 3. Representations and 3D Geometry, Per-Erik Forssén
3D geometry and projective geometry are essential aspects of real world perception for autonomous systems. In this module we will review results from projective geometry, such as plane-to-plane correspondence, epipolar, and oriented epipolar geometry, absolute pose estimation and more. We will put particular emphasis on how distances and errors are best defined, given geometry and probability theory. This is an important consideration when integrating geometric estimation in deep neural networks, and we will also look at how geometric optimization layers can be defined. We will also look at practical implications of the introduced theory for situations such as: learning to estimate absolute pose and learning to perceive depth and 3D structure from video.
Module 1: Lecture slides; original papers by Hyvärinen, Gutmann, Hinton, Ng, Zeiler.
Module 2: Papers by Ng & Jordan, Ulusoy & Bishop, Worrall & Welling, van Gool, Moons, Pauwels & Oosterlinck; several papers on tracking, book chapter by Felsberg
Module 3: Lecture slides plus 4 selected papers – (1) Zhou et al. CVPR2017, (2) Wang et al ECCV2020, (3) Järemo Lawin et al. 3DV 2020, (4) Campbell et all ECCV2020.
Module 1. Active participation in the seminars. Preparatory questions on the seminar papers. Lecture attendance.
Module 2. Active participation in the seminars, handing-in of preparation tasks on the seminar papers, project with report.
Module 3. Active participation in the seminars. Preparatory questions on the seminar papers. Lecture attendance.
Two extra hp can also be obtained by handing in an extended project report in January after the end of the course.
Syllabus (Kursplan)
Course page
If you are not a student at KTH you must login via https://canvas.kth.se/login/canvas