Organizations use their data for decision support and to build data-intensive products and services, such as recommendation, prediction, and diagnostic systems. The collection of skills required by organizations to support these functions has been grouped under the term Data Science. This course will attempt to articulate the expected output of Data Scientists and then teach students how to use PySpark (part of Apache Spark) to deliver against these expectations. The course assignments include Log Mining, Textual Entity Recognition, and Collaborative Filtering exercises that teach students how to manipulate datasets using parallel processing with PySpark.
- Big Data and Data Science
- Introduction to Apache Spark
- Data Management
- Data Quality, Exploratory Data Analysis, and Machine Learning
- Data Management
- Lab 1: Learning Apache Spark – perform your first course lab where you will learn about the Spark data model, transformations, and actions, and write a word counting program to count the words in all of Shakespeare’s plays.
- Lab 2: Web Server Log Analysis with Apache Spark – use Spark to explore a NASA Apache web server log in the second course lab.
- Lab 3: Text Analysis and Entity Resolution – perform text analysis and entity resolution on Google and Amazon product listings using Spark in the third course lab.
- Lab 4: Introduction to Machine Learning with Apache Spark – use Spark’s mllib Machine Learning library to perform collaborative filtering on a movie dataset in the fourth course lab.
The content in this course includes notes and content created by Dan Bruckner, John Canny, Sameer Farooqui, Michael Franklin, Paco Nathan, Kay Ousterhout, Evan Sparks, Shivaram Venkataraman, Patrick Wendell, and Matei Zaharia.