Skip to content

stormsinbrewing/Real_Time_Social_Media_Mining

Repository files navigation

DevOps pipeline for Real Time Social/Web Mining

Workflow

Workflow

  • Setting up Apache Maven for Java project - User Interface and MapReduce functions

  • Setting up GitHub repository workflow

  • Setting up GitHub Actions for automation

  • Creating a web crawler in Python using Tweepy library to fetch data based on some parameter.

  • Create a HDFS cluster for MapReduce functionality and program Hadoop MapReduce in Java

  • Setup Hadoop Core and create Job Tracker and Task Trackers for the project

  • Implement MapReduce in HDFS using Java to count the frequency of significant words in Data dictionary, in Twitter string

  • Configure Apache Maven with MapReduce codes and install Apache Hadoop Jar dependency

  • Configure MapReduce code in GitHub Actions for automation

  • Automate the Big Data pipeline till MapReduce using GitHub Actions

  • WAP in Java to implement MapReduce from JSON file extracted from crawler to find the frequency of significant words - Textual Analysis

  • Data Classification - create a multi-class data dictionary for sentimental analysis - currently for words (in future, we might extend it for phrases and sentences for improved accuracy)

  • Data Predicition - Using the KNN algorithm in Python to find the relation between tweets and their sentiments.

  • Data Visualization - Using the Python matplotlib library to implement visualization.

Important Source files and dependencies

  1. pom.xml - Setup Apache Maven

  2. helloworld.java - Basic Java project setup

  3. maven.yml - setup GitHub Actions

  4. crawler.py - Web Crawler in Python to extract twitter data based on specific hashtags.

  5. info.csv - data file created as an output for the crawler and to be sent to the HDFS core for processing

  6. MapReduce functionalities in Java

  1. Sentimental Analysis in Python
  • Convolutional Neural Networks
  • Decision Tree
  • SVM
  • Pre-Processing
  • Random Forests
  • Naive Bayes
  • XGBoost
  1. matplotlib.py - Data Visualization using matplotlib in python

  2. Hadoop Setup

How to Contribute

It is an open source project. Open for everyone.

Follow these contribution guidelines.

License

MIT License, copyrighted to Storms In Brewing (2019-2020)