Skip to content

hilljb/spark-jupyter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

spark-jupyter

Using Apache Spark in the Jupyter notebook to analyze Twitter

Getting Started from Scratch

You need several things in place before this repository will be useful.

Java

Visit Oracle's Java SE Downloads page and download the latest Java Platform (JDK) version for your machine.

On my Linux Mint machine, I've downloaded jdk-8u91-linux-x64.tar.gz at 173MB and placed it in /opt. After downloading, I performed the following.

$ sudo gunzip jdk-8u91-linux-x64.tar.gz
$ sudo tar xvf jdk-8u91-linux-x64.tar
$ sudo chown -R jason:jason jdk1.8.0_91

I now have the directory /opt/jdk1.8.0_91, which will become my JAVA_HOME directory later inside our configuration.

Spark

On the Apache Spark download page, I've downloaded Spark 1.6.0 in the "pre-build for Hadoop 2.6 and later" version (289MB), and I now have spark-1.6.0-bin-hadoop2.6.tgz also sitting inside my /opt directory. (I am using version 1.6.0 instead of the recently available 1.6.1 as I've had some memory management issues with the newer version.) I've extracted and chowned this as follows.

$ sudo tar zxvf spark-1.6.0-bin-hadoop2.6.tgz
$ sudo chown -R jason:jason spark-1.6.0-bin-hadoop2.6

I now have the directory /opt/spark-1.6.0-bin-hadoop2.6, which will become my SPARK_HOME directory later inside our configuration.

Anaconda

On the Continuum Analytics Anaconda download page, download and install the version of Anaconda appropriate for your machine. (This installation typically involves running a Bash script. See Continuum's instructions.)

Setup

At this point, you should have conda callable at the command prompt (from the Anaconda installation), and you should have both Java and Spark installed in directories that we'll call JAVA_HOME and SPARK_HOME, respectively.

Install the Conda Environment

I have a conda environment created with the required dependencies. Setting that up looks something like the following.

$ conda create -n spark-jupyter python
...
$ source activate spark-jupyter
...
$ pip install matplotlib, pandas, seaborn, simplejson, requests, requests_oauthlib

Run

See the script in the bin directory of this repo and set your variables and options as needed. One thing to keep in mind is that Jupyter has many settings of its own (which port to run on, whether or not to open a browser, directory locations, etc.) and you may want to set those in your ~.jupyter config.

About

Using Apache Spark in the Jupyter notebook to analyze Twitter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published