Pontus Giselsson

Learning from data is becoming increasingly important in many different engineering fields. Models for learning often rely heavily on optimization; training amachine is often done by solving a specific optimization problem – the training problem. These problems are typically of large-scale and the go-to algorithm is stochastic gradient descent (SGD). This module will cover the theoretical foundation of stochastic gradient descent and argue from a theoretical perspective why it is preferable over gradient descent on large-scale training problems with many training examples. We will also cover foundations of convex and non-convex analysis needed to understand the presented theory for stochastic gradient descent and relate this theory to that of the (proximal) gradient descent method.

2.1 Objective

The objective of this module is to analyze the stochastic gradient descent method and relate the theory to that of the gradient descent method. We will analyze SGD under different problem assumptions, for instance: nonconvex but smooth and convex but not necessarily smooth. The real difficulty arises when analyzing non-smooth and general non-convex problems, that appear when training deep supervised learning machines with, e.g., ReLU activation functions. We will analyze SGD when applied to a limited class of non-smooth and non-convex problems, namely so-called weakly convex (that are not necessarily convex) and non-smooth problems.

2.2 Content

We will cover the concepts of convex sets and convex functions. We will also discuss other properties of functions such as smooth functions and strongly- and weakly convex functions. We will also discuss sub-gradients as well as the proximal operator as a means to evaluate a sub-gradient. We will also present and analyze the proximal gradient method, which is a generalization of the gradient method that allows for an additional non-smooth term.

We will see that the proximal operator definition gives rise to the so-called Moreau envelope that we will use in the analysis of the stochastic gradient descent method under weak convexity. We will also describe backpropagation that is used to compute the gradient w.r.t. one training example in deep learning and highlight some effects on the gradient computation due to different network design choices.

Presumed prerequisites: Basic knowledge in (convex) optimization and machine learning.

The course material will consist of short flipped-classroom type videos as well
as hand-in exercises.

The course will probably be taught online. Students will watch the video lectures at home. We will have online Q&A sessions for the video material and exercise sessions where students can ask for help with hand-ins.

A mandatory hand-in.

This year’s course page is found here:

https://kth.instructure.com/courses/29062