Human perception relies to a large extent on vision, and we experience our vision-based understanding of the world as something intuitive, natural, and simple. The key to our visual perception is a feature abstraction pipeline that starts already in the retina. Thanks to this learned feature representation we are able to recognize of objects and scenes from just a few examples, do visual 3D navigation and manipulation, and understand of poses and gestures in real-time. Modern machine learning has brought similar capabilities to computer vision and the key to this progress are the internal feature representations that are learned from generic or problem specific datasets to solve a wide range of classification and regression problems.

Course type:

  • AS track: elective
  • AI track: elective
  • Joint Curriculum: advanced

Time: Given even years, Autumn

Teachers: Christopher Zach (CTH), Michael Felsberg, Per-Erik Forssen (LiU)

Examiner: Per-Erik Forssen (LiU)

The participants are assumed to have a background in mathematics corresponding to the contents of the WASP-course “Mathematics and Machine Learning”.

Module 1 and 3: Knowledge of calculus, linear algebra and especially probability theory is very helpful. Basic understanding of machine learning is preferred. Programming skills in any language.

Module 2: Knowledge about advanced linear algebra, basics in machine learning, signal processing, and image analysis are required. Programming skills in Python+Numpy.

We recommend that you refresh  signal processing knowledge (convolution, correlation, Fourier transform, complex functions), optimization (ridge regression) and differential and integral calculus before the course.

Module 1. Understand and implement several unsupervised approaches to feature learning including score matching, noise-contrastive estimation, latent variable models such as (deep) Boltzmann machines, variational auto-encoders, and contrastive learning of feature representations.

Module 2. Be able to use concepts from computer vision learning such as generative and discriminative models, invariance and equivariance, and open-world problems in the design of algorithms. Implement state-of-the-art algorithms for visual object tracking.

Module 3. Recognize and explain many useful relations in 3D geometry and projective geometry and understand how they can be incorporated in deep neural networks.

Module 1. One of the main goals of representation learning is to learn how to extract generic features valid for a range of tasks from given data. In this module we first focus on energy-based models for representation learning, which are closely linked to unsupervised learning. We will discuss restricted Boltzmann machine and its variants. One particular emphasis is on how to estimate the parameters of energy-based models, since straightforward maximum likelihood estimation is not suitable for these models. We will show how proper scoring rules (specifically score matching and noise-contrastive estimation) can be used to estimate the parameters of energy-based models, and how they are connected with auto-encoders. We also discuss recent unsupervised and weakly supervised contrastive approaches for representation learning.

Module 2. Visual representations can be categorized into generative and discriminative models, depending on whether they are supposed to represent visual appearance explicitly or implicitly. An explicit representation is typically an image patch of a part of a feature map from a deep network. Implicit representations are dual to image patches or feature maps, in the sense that they are optimal for a discriminative task, such as localization, detection, or classification. In particular, we will look into the problem of visual object tracking, starting from classical least-squares approaches (Lucas-Kanade), continuing with online-learned discriminative correlation filters, and eventually ending with deep-feature-fusion-based state-of-the-art algorithms and the transition to video object segmentation.

Module 3. 3D geometry and projective geometry are essential aspects of real world perception for autonomous systems. In this module we will review results from projective geometry, such as plane-to-plane correspondence, epipolar, and oriented epipolar geometry, absolute pose estimation and more. We will put particular emphasis on how distances and errors are best defined, given geometry and probability theory. This is an important consideration when integrating geometric estimation in deep neural networks, and we will also look at how geometric optimization layers can be defined. We will also look at practical implications of the introduced theory for situations such as: learning to estimate absolute pose, and learning to perceive depth and 3D structure from video.

Module 1: Lecture slides; original papers by Hyvärinen, Gutmann, Hinton, Ng, Zeiler.

Module 2: Papers by Ng & Jordan, Ulusoy & Bishop, Worrall & Welling, van Gool, Moons, Pauwels & Oosterlinck; several papers on tracking, book chapter by Felsberg

Module 3: Lecture slides plus 4 selected papers – (1) Zhou et al. CVPR2017, (2) Wang et al ECCV2020, (3) Järemo Lawin et al. 3DV 2020, (4) Campbell et all ECCV2020.

Module 1. Group project (and report) on representation learning for images.

Module 2. Active participation in the two seminars, Handing in of preparation tasks on the seminar papers, Project on DCF tracking with report

Module 3. Active participation in the seminars. Preparatory questions on the seminar papers. Lecture attendance.

If you are not a student at KTH you must login via https://canvas.kth.se/login/canvas