In this lesson, you learn how to get setup with Spark and see the basics of how to program with Spark.
I start with a little bit of history of the project and provide motivation for the framework: why Spark, why now?
From there, I walk through the process of getting Spark setup locally on your laptop so you can start developing your our Spark applications!
And all along the way you learn the common paradigms and abstractions Spark leverages, mainly functional programming and resilient distributed datasets.
- Understand the history and motivation behind Spark
- Setup a local Spark environment
- Program your first Spark job with the PySpark shell
- Understand the common paradigms for programming with Spark: RDDs and functional programming
- Work with key-value pairs to perform MapReduce operations
- 1.9: coin.py
- 1.10: big_spenders.py
- Moore's Law
- At the limit of Moore's law: scientists develop molecule-sized transistors
- Quora: Why haven't CPU clock speeds increased in the last 5 years?
- Why CPUs aren't getting any faster
- An Architecture for Fast and General Data Processing on Large Clusters (Matei's Dissertation)
- The Google File System (original paper)
- Nutch, and Search Engine History
- MapReduce: Simplified Data Processing on Large Clusters (original paper)
- Doug Cutting: The History of Hadoop
- Hadoop: A brief History
- Spark: Cluster Computing with Working Sets (original paper)
- About Databricks
- The Apache Software Foundation Announces Apache™ Spark™ as a Top-Level Project
- Apache Spark’s journey from academia to industry
- The State of Spark: And where we are going next
- Installing R
- IRKernel Homepage
- SparkR Setup
- Jupyter/IPython tutorial
- Cloudera QuickStart VM
- Hortonworks Sandbox VM