In this project, we used yelp challenge data. Our goal is to analyze user pattern and build a recommender system for users. We focus on the users who have rated businesses in Great Toronto Area. We processed data and built a web application tool to demonstrate the results by using big data technologies.
- data/db
Contains mongoDB data used in this application - data_preprocess_code
Contains codes that accomplish following tasks:
- Collect data in the Great Toronto Area
- Convert string id to int id
- Process data for building user-business relationship
- Process data for finding user compliments and votes.
- Process data for analyzing user reviews by using TF*IDF
- Process data for gathering user rated businesses, identifying new categories and re-assigning new categories
- Train the recommendation system
- yelpserver
Contains the visualization web application tool that shows all our results.
The web application does following tasks:
- Query data from mongoDB ( yelpserver/app.py, db.py )
- Populate data to web frontend ( yelpserver/app.py )
- Visualize data at web frontend ( yelpserver/static/js/recommend.js,vs.js , yelpserver/templates/index.html,user.html, recommend.html)
- Construct user id and new business id to feed to pre-trained recommender model ( yelpserver/app.py )
- Invoke spark to load pre-trained model and make prediction (yelpserver/recommenderSystem.py)
- Install Mongo DB server locally
- Git clone repository
- Under the repository folder, start mongo db with data
mongod --dbpath data/db
- CD to yelpserver folder
- Start web server by using command:
spark-submit server.py
- In browser, type 0.0.0.0/5000
- Data Processing: Spark, Spark SQL, Spark MLlib
- Web backend: flask, spark, cherrypy
- Web frontend: D3.js, DC.js, crossfilter.js, Leaflet.js, keen.js, bootstrap v4
- Data Storage: MongoDB
- Other tools/technologies: Gephi for user-business relationship graph, yelp GraphQL for addition data query