Skip to content

tweichle/Spark-for-Big-Data

Repository files navigation

Spark-for-Big-Data

Udacity Course

This repository demonstrates how to use Spark to work with big data and build machine learning models at scale.

Goals

  • Practice processing and cleaning datasets to get comfortable with Spark’s SQL and dataframe APIs (Spark SQL, PySpark).
  • Debug and optimize for data skewness when running on a cluster.
  • Use Spark’s Machine Learning Library (MLlib) to train machine learning models at scale.