PySpark

Spark is a great language for performing exploratory data analysis at scale, building machine learning pipelines, and creating ETLs for a data platform. Spark is a must for anyone who is dealing with Big-Data. Using PySpark (which is a Python API for Spark) to process large amounts of data in a distributed fashion is a great way to manage large-scale data-heavy tasks and gain business insights while not sacrificing on developer efficiency. In a few words, PySpark is a fast and powerful framework to perform massive distributed processing over resilient sets of data.

Motivation

I felt that any organization that deals with big data and data warehouse, some kind of distributed system needed. Being one of the most widely used distributed system, Spark is capable of handling several petabytes of data at a time, distributed across a cluster of thousands of cooperating physical or virtual servers. But most importantly, it's simple and fast.

I thought data professionals can benefit by learning its logigstics and actual usage. Spark also offers Python API for easy data managing with Python (Jupyter). So, I have created this repository to show several examples of PySpark functions and utilities that can be used to build complete ETL process of your data modeling. The posts are more towards people who are already familari with Python and a bit of data analytics knowledge (where I often skip the enviornment set-up). But you can always follow the Installtation section if not familiar, then you should be able follow the notebook with no big issues. PySpark allows us to use Data Scientists' favoriate Jupyter Notebook with many pre-built functions to help processing your data. The contents in this repo is an attempt to help you get up and running on PySpark in no time!

Installation

Downloading PySpark on your local machine could be a little bit tricky at first, but

First things First, make sure you have Jupyter notebook installed

Install Jupyter notebook

pip install jupyter notebook

Install PySpark Make sure you have Java 8 or higher installed on your computer. But most likely Java 10 would through an error. The recommended solution was to install Java 8 (Spark 2.2.1 was having problems with Java 9 and beyond)

Of course, you will also need Python (I recommend > Python 3.5 from Anaconda). Now visit the Spark downloads page. Select the latest Spark release, a prebuilt package for Hadoop, and download it directly.

Set the environment variables:

SPARK_HOME = D:\Spark\spark-2.3.0-bin-hadoop2.7
PATH += D:\Spark\spark-2.3.0-bin-hadoop2.7\bin

For Windows users,

Download winutils.exe from here: https://github.com/steveloughran/winutils
Choose the same version as the package type you choose for the Spark .tgz file you chose in section 2 “Spark: Download and Install” (in my case: hadoop-2.7.1)
You need to navigate inside the hadoop-X.X.X folder, and inside the bin folder you will find winutils.exe
If you chose the same version as me (hadoop-2.7.1) here is the direct link: https://github.com/steveloughran/winutils/blob/master/hadoop-2.7.1/bin/winutils.exe
Move the winutils.exe file to the bin folder inside SPARK_HOME (i.e. C:\Spark\spark-2.3.0-bin-hadoop2.7\bin)
Set the folowing environment variable to be the same as SPARK_HOME: HADOOP_HOME = D:\Spark\spark-2.3.0-bin-hadoop2.7

Restart (our just source) and Run pyspark command your terminal and launch PySpark:

$ pyspark

For video instruction of installtion on Windows/Mac/Ubuntu Machine, please refer to each of the YouTube links below

Windows Link
Mac Link
Ubuntu Link

Or More Blogs I found on installations steps

PySpark

These examples display unique functionalities available in PySpark. They cover a broad range of topics with different method that user can utilize inside PySpark.

PySpark and SparkSQL Complete Guide

Apache Spark is a cluster computing system that offers comprehensive libraries and APIs for developers, and SparkSQL can be represented as the module in Apache Spark for processing unstructured data with the help of DataFrame API. In this notebook, we will cover the basics how to run Spark Jobs with PySpark (Python API) and execute useful functions insdie. If followed, you should be able to grasp a basic understadning of PySparks and its common functions.

PySpark Dataframe Complete Guide (with COVID-19 Dataset)

Spark which is one of the most used tools when it comes to working with Big Data, but whereas Spark used to be heavily reliant on RDD manipulations, Spark has now provided a DataFrame API for us Data Scientists to work with. So in this notebook, We will learn standard Spark functionalities needed to work with DataFrames, and finally some tips to handle the inevitable errors you will face.

Binary Tabular Data Classification with PySpark (Tabular Data)

This notebook covers a classification problem in Machine Learning and go through a comprehensive guide to succesfully develop an End-to-End ML class prediction model using PySpark. We will use a number of different supervised algorithms to precisely predict individuals’ income using data collected from the 1994 U.S. Census. We will then choose the best candidate algorithm from preliminary results and further optimize this algorithm to best model the data. Our goal with this implementation is to build a model that accurately predicts whether an individual makes more than $50,000

End-to-End Binary Classification ML Model with PySpark and MLlib (1)

In-Memory computation and Parallel-Processing are some of the major reasons that Apache Spark has become very popular in the big data industry to deal with data products at large scale and perform faster analysis. we are going to use a real world dataset from Home Credit Default Risk competition on kaggle. The target variable is either 0 (applicants who were able to pay back their loans)or 1 (applicants who were NOT able to pay back their loans). it is a binary classification problem with a highly imbalanced target label.

End-to-End Binary Classification ML Model with PySpark and MLlib (2)

Machine learning in the real world is messy. Data sources contain missing values, include redundant rows, or may not fit in memory. Feature engineering often requires domain expertise and can be tedious. Modeling too often mixes data science and systems engineering, requiring not only knowledge of algorithms but also of machine architecture and distributed systems. In this notebook, we build a model to predict the quality of Portugese "Vinho Verde" wine based on the wine's physicochemical properties. It covers data importing, visualization, parallel hyperparameter computation, Explore the best performing model in MLflow, etc.

Multi-class Text Classification Problem with PySpark and MLlib

Apache Spark is quickly gaining steam both in the headlines and real-world adoption, mainly because of its ability to process streaming data. With so much data being processed on a daily basis, it has become essential for us to be able to stream and analyze it in real time. We use Spark Machine Learning Library (Spark MLlib) to solve multi-class text classification problem

Multi-class classification using Decision Tree Problem with PySpark

This notebook covers a full multi class classification problem with Decision Tree method to look at the SFO airport data to predict which customer to give the overall rating. It covers a complete cycle of modeling (data loadgin, create a model, evaludate a model, feature importance).

Setting up Fast Hyperparameter Search Framework with Pyspark

In this notebook, we set up hyperparameter tuning framework in PySpark using machine learning libraries like scikit-learn/xgboost/lightgbm. Usually manual tuning has to change a lot of parameters. Hyperopt works only one model at a time. So it was taking up a lot of time to train each model and I was pretty short on time. But what if we can parallelize my model hyperparameter search process? We can choose to load your data using Spark, but here I start by creating our own classification data to set up a minimal example which we can work with.rt data to predict which customer to give the overall rating. It covers a complete cycle of modeling (data loadgin, create a model, evaludate a model, feature importance).

PySpark Know-How in Pratice(Advanced)

In this notebook, there would be a lot of advanced Spark Tips introduced that can be applied to boost the data processing. We go over two important method to increase performance and reduce cost: Re-partitioning and Coalesce. Parallelism allows to perform millions tasks simultaneosly on numerous number of machines in a cluster independently. Under the hood, each dataframe (RDD) is stored in partitions on different cluster nodes.

5 Spark Tips that will get you to another level(Advanced)

The advanced (5) tips that will make you the master of Spark is here. Make sure to not slip any of it!

Contact

Created by @hyunjoonbok - feel free to contact me!

Resource

Ultimate PySpark Cheat Sheet Blog
Use Apache Arrow to Assist PySpark in Data Processing Medium
Real Python Website
Luminousmen Blog Blog

Reference

Victor Romain's Medium
Official PySpark Documentation Link

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
.ipynb_checkpoints		.ipynb_checkpoints
data		data
img		img
Binary Tabular Data Classification with PySpark.ipynb		Binary Tabular Data Classification with PySpark.ipynb
End-to-End Machine Learning Model using PySpark and MLlib (2).ipynb		End-to-End Machine Learning Model using PySpark and MLlib (2).ipynb
End-to-End Machine Learning Model using PySpark and MLlib.ipynb		End-to-End Machine Learning Model using PySpark and MLlib.ipynb
Multi-class Text Classification Problem with PySpark and MLlib.ipynb		Multi-class Text Classification Problem with PySpark and MLlib.ipynb
Multi-class classification using Decision Tree Problem with PySpark .ipynb		Multi-class classification using Decision Tree Problem with PySpark .ipynb
Predict Customer Churn using PySpark Machine Learning.ipynb		Predict Customer Churn using PySpark Machine Learning.ipynb
PySpark Dataframe Complete Guide (with COVID-19 Dataset).ipynb		PySpark Dataframe Complete Guide (with COVID-19 Dataset).ipynb
PySpark and SparkSQL Complete Guide.ipynb		PySpark and SparkSQL Complete Guide.ipynb
README.md		README.md
Setting up Fast Hyperparameter Search Framework with Pyspark.ipynb		Setting up Fast Hyperparameter Search Framework with Pyspark.ipynb
[Advanced] 5 Spark Tips that will get you to another level.ipynb		[Advanced] 5 Spark Tips that will get you to another level.ipynb
[Advanced] Spark Know-How in Pratice .ipynb		[Advanced] Spark Know-How in Pratice .ipynb

hyunjoonbok/PySpark

Folders and files

Latest commit

History

Repository files navigation

PySpark

Motivation

Table of contents

Installation

PySpark

Contact

Resource

Reference

About

Topics

Resources

Stars

Watchers

Forks

Languages