Skip to content

πŸ€” ✍️ Simple user-user recommendation system for gardening supplies to explain collaborative filtering and practice using Spark on Amazon product dataset 🌳 🌿

License

Notifications You must be signed in to change notification settings

zuzannna/GardeningRecommendations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

26 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Spring has sprung in NYC and it's time for gardening! But how to buy the right hose when you're a total gardening newbie?

This is a short step-by-step tutorial on collaborative filtering based recommendation systems on Amazon product data. Detailed instructions are included in the Jupyter Notebook (GardeningToolsRecommender.ipynb), so feel free to check it out. Below, I included materials I found super useful to learn about recommendation systems & Apache Spark (which is used to paralelize alternative least squares at the end of the notebook).

For instructional purposes, the code is not optimized for speed.

recommendation_systems

source of the image: "Recommendation systems: Principles, methods and evaluation" by Isinkayea, Folajimib and Ojokohc, http://doi.org/10.1016/j.eij.2015.06.005

Resources:

Data:

Amazon's data set contains reviews and metadata spanning from 1996 to 2014 and is an excellent source if you want to practice recommendation algorithms. As you might suspect, it's huge, but Julian McAuley from UCSD also shared smaller subsets. I decided to use one of the 5-core datasets which contain entries from users who reviewed at least 5 products and products which were reviewed at least 5 times, which drastically limits the size of it allowing to run costly algorithms (such as ALS) on a personal laptop within a reasonable time (it took me few minutes).

Readings:

Blog posts about collaborative filtering and Alternative Least Squares:

Andrew Ng's awesome intro to recommender systems (part of his ML coursera series, so pretty basic)

Ethan Rosenthal's excellent blog post about collaborative filtering and matrix factorization (a bit more advanced).

Alex Abate on collaborative filtering - I heavily borrowed from her prediction rating code, which demonstrates step-by-step how it works.

bugra on Alternating Least Squares.

Apache Spark

Using Apache Spark makes a lot of sense when we're using iterative algorithms (such as gradient descent or alternative least squares) as it leverages its capacity for caching / persisting (taking a snapshot of the data and iterating only over steps which are unique across iterations).

Running Apache Spark with Jupyter Notebook:

It's not hard at all, actually! Make sure SSH is enabled on your machine, your Java is up to date, and download + install Spark. In my case, in order to run it I need to execute in the Terminal:

    $ export PATH=$PATH:/usr/local/spark/bin:/usr/local/spark/sbin

followed by:

    $ start-all.sh

and to lanch Jupyter Notebook with Spark:

    $ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook --no-browser --ip="*"" pyspark --master local[*]

Then, Jupyter Notebook will run on localhost:8888, your Spark cluster UI on localhost:8080 and Spark Jobs on localhost:4040. Those tips come from Austin Ouyang who wrote a great step-by-step intro and gave a two day workshop at Insight Labs that I attended (and you can sign up for too:).

Machine learning and collaborative filtering with Spark:

http://spark.apache.org/docs/latest/mllib-guide.html

https://spark.apache.org/docs/latest/mllib-collaborative-filtering.html

A great tutorial on recommendations systems and Spark with MovieLens data: https://databricks-training.s3.amazonaws.com/movie-recommendation-with-mllib.html

Requirements:

  1. Python (2.7)
  2. Jupyter Notebook
  3. NumPy
  4. SciPy
  5. matplotlib
  6. Pandas
  7. scikit learn

To install all of them (except Python) using pip run:

 pip install -r requirements.txt

About

πŸ€” ✍️ Simple user-user recommendation system for gardening supplies to explain collaborative filtering and practice using Spark on Amazon product dataset 🌳 🌿

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published