Machine learning aims to extract knowledge from data and enables a wide range of applications. With datasets rapidly growing in size and complexity, learning techniques are fast becoming a core component of large-scale data processing pipelines. This course introduces the underlying statistical and algorithmic principles required to develop scalable real-world machine learning pipelines. We present an integrated view of data processing by highlighting the various components of these pipelines, including feature extraction, supervised learning, model evaluation, and exploratory data analysis. Students will gain hands-on experience applying these principles by using Apache Spark to implement several scalable learning pipelines.

Certificate (PDF)


  • Introduction to Apache Spark
  • Linear Regression and Distributed Machine Learning Principles
  • Logistic Regression and Click-through Rate Prediction
  • Principal Component Analysis and Neuroimaging


Lab 1: NumPy, Linear Algebra, and Lambda Function Review. Gain hands on experience using Python’s scientific computing library to manipulate matrices and vectors, and learn about lambda functions which will be used throughout the course.

Lab 2: Learning Apache Spark. Perform your first course lab where you will learn about the Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare’s plays.

Lab 3: Millionsong Regression Pipeline. Develop an end-to-end linear regression pipeline to predict the release year of a song given a set of audio features. You will implement a gradient descent solver for linear regression, use Spark’s machine Learning library ( mllib) to train additional models, tune models via grid search, improve accuracy using quadratic features, and visualize various intermediate results to build intuition.

Lab 4: Click-through Rate Prediction Pipeline. Construct a logistic regression pipeline to predict click-through rate using data from a recent Kaggle competition. You will extract numerical features from the raw categorical data using one-hot-encoding, reduce the dimensionality of these features via hashing, train logistic regression models using mllib, tune hyperparameter via grid search, and interpret probabilistic predictions via a ROC plot.

Lab 5: Neuroimaging Analysis via PCA – Identify patterns of brain activity in larval zebrafish. You will work with time-varying images (generated using a technique called light-sheet microscopy) that capture a zebrafish’s neural activity as it is presented with a moving visual pattern. After implementing distributed PCA from scratch and gaining intuition by working with synthetic data, you will use PCA to identify distinct patterns across the zebrafish brain that are induced by different types of stimuli.


We're not around right now. But you can send us an email and we'll get back to you, asap.


©2024 Thafez Template a premium and multipurpose theme from Thafez Lab.

Inicia Sesión con tu Usuario y Contraseña

¿Olvidó sus datos?