Skip to content

A platform to serve moderators and researchers to leverage open data to understand and further research in Wikipedia edits contribution based on geospatial and temporal event distribution

License

pratikwatwani/Unbiased

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

99 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

python java Scala Spark flask postgres timescaledb psycopg2 dash AWS website contributions welcome GitHub issues Open Source? Yes!

UNBIASED
Spatio Temporal Event Based Influence on Wikipedia Edits

Motivation🚀

Every day there are thousands of notable transactions over the globe; protests, market dips, terrorist attacks, etc.

The question is, Do global events lead to influence in edits of Wikipedia articles?

UNBIASED is a tool to serve moderators and researchers to leverage open data to understand and further research patterns in Wikipedia edits contribution.

Data🪣

Type Source Size Update Frequency Location
GDELT, Global Database of Events, Language, and Tone 6+ TB 15 minutes Public S3
Wikipedia Metadata ~500 GB Varies Private S3

GDELT:

The GDELT Project monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.



Wikipedia Metadata:

Historical and Current dump of English Wikipedia consisting metadata including edits, commits, messages, userids', timestamp of each edit on the wikipedia article.





Pipeline Architecture🔗


Architectural Components🗜️

Entity Purpose Type
AWS S3 Raw Data Storage -
AWS EC2 Spark Cluster,
Decompressor
Master - 1 x m5a.large
Worker - 5 x m5a.large
AWS EC2 TimescaleDB 1 x m5.xlarge
AWS EC2 Web App 1 x t3.large
AWS EC2 Airflow Scheduler 1 x m5.large

Challenges🤕

Data

  1. Splitting, keyword generation and binning.
  2. Fuzzy pattern matching.
  3. Data Modeling
  4. Query processing optimization.

Architectural

  1. Database parameter optimization.
  2. PySpark tuning.

UI🖥




Directory Structure🗂️

/
│
├── assets
│     ├── logo.png
│     ├── pipeline.png
│     ├── dataingestion
│     └── dataingestion
│
├──  src
│     │ 
│     ├── dataingestion
│     │     ├── scraper.py
│     │     ├── scraperModules
│     │     │      ├── __init__.py 
│     │     │      ├── linkGenerator.py
│     │     │      ├── fileWriter.py
│     │     ├── lists
│     │     │      ├── current_urls.txt
│     │     │      └── historic_urls.txt
│     │     └── runScrapper.sh
│     │
│     ├── decompressor  
│     │     └── decompressor.sh
│     │
│     ├── processor
│     │     ├── dbWriter.py
│     │     ├── wikiScraper.py
│     │     ├── gdeltProc.py
│     │     ├── gdeltModules
│     │     │      ├── __init__.py
│     │     │      ├── eventsProcessor.py
│     │     │      ├── geographiesProcessor.py
│     │     │      ├── mentionsProcessor.py
│     │     │      └── typeCaster.py
│     │     ├── wikiModules
│     │     │      ├── __init__.py
│     │     │      ├── metaProcessor.py
│     │     │      └── tableProcessor.py
│     │     ├── gdelt_run.sh
│     │     └── wiki_run.sh
│     │
│     ├── frontend
│     │     ├── __init__.py
│     │     ├── application.py
│     │     ├── appModules
│     │     │      ├── __init__.py
│     │     │      ├── dbConnection.py
│     │     │      └── dataFetch.py
│     │     ├── requirements.txt
│     │     ├── queries
│     │     │      ├── articleQuery.sql
│     │     │      └── scoreQuery.sql
│     │     └── assets
│     │            ├── layout.css
│     │            ├── main.css
│     │            └── logo.png
│     │
│     └── airflow
│           └── dag.py
│
├── License.md
├── README.md
├── config.ini
└── .gitignore

Instructions📝

Setup

  1. Setup AWS Cluster

    Follow instructions below, link by link to setup a cluster and spin up instances as mentioned above in Architectural Components

    a. https://blog.insightdatascience.com/simply-install-spark-cluster-mode-341843a52b88
    b. https://blog.insightdatascience.com/how-to-access-s3-data-from-spark-74e40e0b2231

  2. Setup TimescaleDB

    Follow instructions from official blog of TimescaleDB
    https://blog.timescale.com/tutorials/tutorial-installing-timescaledb-on-aws-c8602b767a98/

    Follow this video to setup connection to cluster
    https://www.youtube.com/watch?v=5dYeYIWaXjc&feature=youtu.be

    Use this website to optimize databse capabilities
    https://pgtune.leopard.in.ua/#/

  3. Setup frontend framework

    Follow this guide from Digital Ocean:

    a. https://www.digitalocean.com/community/tutorials/how-to-install-nginx-on-ubuntu-18-04
    i. Be sure to do sudo ufw allow for SSH as well when you get to that step, or you will not be able to SSH into your instance!
    ii.When my ufw status is listed as inactive, it fixed it to run sudo ufw enable
    b. https://www.digitalocean.com/community/tutorials/how-to-serve-flask-applications-with-gunicorn-and-nginx-on-ubuntu-18-04
    i. The normal port for Dash is 8080, not 5000
    ii. These instructions are applicable to the underlying flask app. To expose the underlying Flask app, put server = app.server at the top of your main Dash script. Now, substitute server for app in the instructions. Otherwise you will get errors saying the app is not callable.
    iii.When you go to deploy the app, if you made a file/sim link for your domain in /etc/nginx/sites-available/ and in /etc/nginx/sites-enabled/, this may now conflict with the new files you made. Get rid of the original files.

  4. Setup Airflow
    Setup Airflow as per instructions from this Medium Blog:
    https://blog.insightdatascience.com/scheduling-spark-jobs-with-airflow-4c66f3144660

Code Execution

  1. Scraping
    sh src/dataingestion
    sh dataingestion/scraper.sh
  2. Decompression
    cd src/decompression
    sh decompressor.sh
  3. Processor
    cd src/processor
    sh gdelt_run.sh
    sh wiki_run.sh
  4. Dashboard
    cd src/frontend
    python application.py

Optimizations⚙️

  1. Unpigz
  2. Data Modeling
  3. Query optimization
  4. Database parameters
  5. Serializing
  6. Oversubscription
  7. Partitioning
  8. Spark-Submit

License🔑

This project is licensed under the AGPL-3.0 License - see the LICENSE.md file for details

© All product names, logos, and brands are property of their respective owners.

About

A platform to serve moderators and researchers to leverage open data to understand and further research in Wikipedia edits contribution based on geospatial and temporal event distribution

Topics

Resources

License

Stars

Watchers

Forks