This is the implementation of our ICT2107 Hadoop Project done by:
Student | Student ID |
---|---|
Fabian Chua | 2101506 |
Shaun Sartra Varghese | 2102172 |
Pang Ka Ho | 2102047 |
Norman Chia | 2100686 |
Wang Qixian | 2101751 |
You can view the dashboard demo here on https://ict2107-team-p3-6-project.vercel.app/
- run 'pip install -r requirements.txt'
- change GlassDoor link on line 142 of GlassDoorCrawler.py to the company you want to crawl
- Run GlassDoorCrawler.py
- Crawled reviews will be automatically saved to a CSV file named after the company
- Ensure that the input, output directory the list of negative and positive words is present in the HDFS. (Reader may use the word list that we have used by going to the following directory: Backend > SentimentAnalysis > words.csv)
- The input directory should contain 1 or more csv files as input data in the following format: Summary,Date,JobTitle,AuthorLocation,OverallRating,Pros,Cons
- Navigate to the folder that contains the lexiconSa jar file.
- Usage: hadoop jar group_p3_6_lexiconSa.jar org.example.WordComparisonAnalysis <input_dir> <output_dir> <words.csv>
- Example usage :
hadoop jar group_p3_6_lexiconSa.jar org.example.WordComparisonAnalysis hdfs://localhost:9000/user/shaunv/project/wordMapInput/ hdfs://localhost:9000/user/shaunv/project/wordMapOutput/ hdfs://localhost:9000/user/shaunv/project/words.csv
- Output will be present in the output_dir specified in HDFS.
⚠️ Running the CoreNLP analysis on large datasets is VERY resource intensive and takes a VERY LONG time: We recommend using a smaller test dataset!
- Ensure that the input, output directory is present in the HDFS.
- The input directory should contain 1 or more csv files as input data in the following format: Summary,Date,JobTitle,AuthorLocation,OverallRating,Pros,Cons
- Navigate to the folder that contains the nlpSa jar file.
- Usage: hadoop jar group_p3_6_nlpSa.jar org.example.GlassdoorSentimentAnalysis <input_dir> <output_dir>
- Example usage :
hadoop jar group_p3_6_nlpSa.jar org.example.GlassdoorSentimentAnalysis hdfs://localhost:9000/user/shaunv/project/nlpSaInput/ hdfs://localhost:9000/user/shaunv/project/nlpSaOutput/
- Output will be present in the output_dir specified in HDFS.
- Ensure that the input directory and a list of stopwords is present in the HDFS. (Reader may use the stopwords that we have used by going to the following directory: Backend > TopicModelling > mallet > stopwords.txt
- The input directory should contain 1 or more csv files as input data in the following format: Summary,Date,JobTitle,AuthorLocation,OverallRating,Pros,Cons
- Navigate to the folder that contains the topic modelling jar file.
- Usage: hadoop jar group_p3_6_tm.jar <input_dir> <output_dir> <stopwords.txt>
- Example usage :
hadoop jar group_p3_6_tm.jar project/input project/output project/stopwords.txt
- Output will be present in the output_dir specified in HDFS.