cookiecutter-pyspark-cloud

Run PySpark code in the 'cloud' with Amazon Web Services (AWS) Elastic MapReduce (EMR) service in a few simple steps with this cookiecutter project template!

Quickstart

pip install -U "cookiecutter>=1.7"
cookiecutter --no-input https://github.com/daniel-cortez-stevenson/cookiecutter-pyspark-cloud.git
cd pyspark-cloud
make install
pyspark_cloud

Your console will look something like:

Features

AWS ☁️ Cloudformation Template for EMR: Simple Spark cluster deployment with infrastructure as code
- JupyterHub is installed to the EMR Master node for development, and is backed by AWS S3 for persistent storage
- JupyterLab endpoint available at https://master-dns:9443/lab
- Jupyter Notebook 📔 endpoint available at https://master-dns:9443/tree with sparkmagic kernel
- Includes examples of bootstrapping your cluster with bash scripts and EMR Steps 👀
A Command-Line Interface for Running PySpark 'Jobs': For production 🚀 runs via EMR Step API
- Uses the concept of 'jobs', which run PySpark scripts as a Python function via a common entrypoint - this an important point
- Checkout the Medium article, which inspired a lot of this
Log Like a Pro: Save time debugging in style 💃
Wrap Scala with Python 🐍: Use libraries that haven't been included in the PySpark API!
- An example of wrapping Scala Spark API code with PySpark API code is provided with SnowballStemmer
- Could be extended to other Scala MLlib classes (and other Scala classes that implement the UDF interface)
Simplify Workflows with Make ✅: A Makefile with commands for installation, development, and deployment.
- use with make [COMMAND]
- For example, distribute an executable .egg 🥚 distribution of your PySpark code to AWS S3 with make s3dist
Organize Your Code: Package code shared between 'jobs' in a Python module of your package called common
Extend the PySpark API: An example of extending the PySpark SQL DataFrame class, which allows chaining custom transformations with dot . notation
- checkout this awesome PySpark utilities & extensions repo Quinn
Development Framework: All the tools you need
- Use bump2version to version your project
- Use CodeCov to track the completeness of unit tests - see codecov.yml
- Use [Flake8] to write Python code with common style & formatting conventions

Infrastructure Overview

As defined in the Cloudformation template

Usage

Clone this repo:

git clone https://github.com/daniel-cortez-stevenson/cookiecutter-pyspark-cloud.git
cd cookiecutter-pyspark-cloud

Create a Python environment with dependencies installed:

conda create -n cookiecutter -y "python=3.7"
pip install -r requirements.txt

conda activate cookiecutter

Make any changes to the template, as you wish.
Create your project from the template:

cd ..
cookiecutter ./cookiecutter-pyspark-cloud

Initialize git:

cd *your-repo_name*
git init
git add .
git commit -m "Initial Commit"

Create a new Conda environment for your new project & install project development dependenices:

conda deactivate
conda create -n *your-repo_name* -y "python=3.6"

make install-dev

Contribute

Contributions are welcome! Thanks!

Submit an Bug or Feature Request

Submit a Pull Request

Acknowledgements

Most of the ideas expressed in this repo are not new, but rather expressed in a new way. Thanks, folks! 🙌

@MrPowers for the DataFrame extension snippet
@ekampf for the original concept for the pyspark_entrypoint

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github		.github
doc/img		doc/img
{{cookiecutter.repo_name}}		{{cookiecutter.repo_name}}
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml
cookiecutter.json		cookiecutter.json
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github

.github

doc/img

doc/img

{{cookiecutter.repo_name}}

{{cookiecutter.repo_name}}

.gitignore

.gitignore

CODE_OF_CONDUCT.md

CODE_OF_CONDUCT.md

LICENSE

LICENSE

README.md

README.md

_config.yml

_config.yml

cookiecutter.json

cookiecutter.json

requirements.txt

requirements.txt

Repository files navigation

cookiecutter-pyspark-cloud

Quickstart

Features

Infrastructure Overview

Usage

Contribute

Acknowledgements

About

Releases

Packages

Languages

License

daniel-cortez-stevenson/cookiecutter-pyspark-cloud

Folders and files

Latest commit

History

Repository files navigation

cookiecutter-pyspark-cloud

Quickstart

Features

Infrastructure Overview

Usage

Contribute

Acknowledgements

About

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages