This is a brief overview of the content of the course Scalable Data Science and Distributed Machine Learning.
The course is given in three modules.
Module 1 – Introduction to data science & analysis of parallel algorithms.
Introduction to theoretical fundamentals of parallel algorithms and runtime analysis on a parallel random access machine model will be complemented by an introduction to Apache Spark.
Theoretical topics: sequential & parallel random access machine models, work-depth models, Brent’s theorem, scheduling, sum, all-prefix sum, sorting, matrix multiplication, minimum spanning trees, iterative solution of linear systems, unconstrained and constrained optimization, including gradient descent, stochastic gradient descent & hogwild,
Practical topics: introduction to the data science process, Apache Spark and Scala.
Module 2 – Introduction to distributed algorithms for data science & machine learning.
Introduction to the analysis of distributed algorithms running on a cluster of machines will be complemented by implementations in Apache Spark’s ecosystem.
Theoretical topics: distributed work-depth models, communications complexity analysis, distributed summation, sorting, joining & optimisation, map-reduce model, page rank, and distributed linear algebra.
Practical topics: using distributed algorithms in core, SQL and ML libraries of Apache Spark.
Module 3 – Diving deeper
Practical pathways for students to become familiar with scalable data science processes and distributed machine learning pipelines for typical decision problems, including estimation, prediction and testing in various domains will be provided to help inspire student group projects. The last module is an opportunity to dive deeper with a small group of 2 to 4 peers into a problem, domain and/or method of interest to the group.
Teaching and working methods
The course includes three 2-day meetings with intense interactions on-site, typically a mixture of lectures, exercises and other activities. Students are expected to go through the provided lecture materials, code notebooks, references and videos ahead of the meetings for a flipped classroom experience. The course is self-contained with a chapter on preliminaries.