This is a brief overview of the content of the course ​Learning Feature Representations.

Vision is the human sense that requires the largest cortical capacity for extracting information from the environment. The complexity of vision is at a level that engineering approaches have only achieved partial success during the past half decade and it is only recently and with the help of Deep Learning that performance of computer vision algorithms excels human performance. A fundamental concept within vision is the representation in terms of features – but what is a feature and how did its role change in the transition from engineered systems to learning-based systems? In order to achieve a good understanding and explain what is going on in state-of-the-art vision systems, deep knowledge in feature representations is required.

The goal of the course is to give participants the theoretical knowledge and practical experience required to use state-of-the-art computer vision methods in their own research – for example, to integrate existing software components for machine perception into an autonomous system, or to understand and improve a cutting-edge deep learning computer vision algorithms.

The course is divided into two-day modules. The following is a non-exhaustive overview of the main topics of the three modules.

Course Module 1: Energy-based representation learning

One of the main goals of representation learning is to learn how to extract generic features valid for a range of tasks from given data. In this module we focus on energy-based models for representation learning, which are closely linked to unsupervised learning. We will discuss restricted Boltzmann machine and higher-order variants. One particular emphasis is on how to estimate the parameters of energy-based models, since straightforward maximum likelihood estimation is not suitable for these models. We will show how proper scoring rules (specifically score matching and noise-contrastive estimation) can be used to estimate the parameters of energy-based models, and how they are connected with auto-encoders.

Course Module 2: Learning generative and discriminative appearance models

Visual representations can be categorized into generative and discriminative models, depending on whether they are supposed to represent visual appearance explicitly or implicitly. An explicit representation is typically an image patch of a part of a feature map from a deep network. Implicit representations are dual to image patches or feature maps, in the sense that they are optimal for a discriminative task, such as localization, detection, or classification. In particular, we will look into the problem of visual object tracking, starting from classical least-squares approaches (Lucas-Kanade), continuing with online-learned discriminative correlation filters, and eventually ending with deep-feature-fusion-based state-of-the-art algorithms.

Module 3: Representations of motion and geometry

3D geometry and projective geometry are essential aspects of real world perception for autonomous systems. In this module we will look at well known results from projective geometry, such as plane-to-plane correspondence, general two-view geometry, and absolute pose estimation. As motion is a natural component of an autonomous system, we then move on to the generalizations of these results from the static case, to the differential and continuous-time cases. We will also look at different ways to parameterize geometric relations, and what implications these have in practical situations, such as when learning to estimate pose, and to perceive depth and/or 3D geometry from video.