Skip to content

JonathanPollyn/Spark

Repository files navigation

Spark

Introductioni

That you for stopping by my Spark project. From the research that I have done so far, Apache Spark is a suitable computing engine and library suite for parallel data processing on computer clusters. In the repo, I coded some basics of Spark using Python. The repo contains codes for Spark DataFrame, working with Operators in Spark and working with missing values. It is not an exhaustive list; this was my getting started working on the tool.

Description

To work with Spark on the local machine, you must install some packages and create a local variable enabling Spark to run on the local machine. To get Spark to work using the notebook on this repo, you need to download some and create local variables. Below are the instructions.

  1. Requirements for Spark setup in a windows machine
  2. JDK
  3. Python
  4. Hadoop winutiles
  5. Spark Binaries
  6. Environmental Variables
  7. Python IDE (VS Code or Jupyter Notebook)

Contributors

This repo is created for learning purpose and if you have any interesting of being a contributor or you want to give idea of how to make things better, please let me know

About

This notebook contains detailed code for spark and machine learning and databricks

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published