Reddit Data

As of Jan 2024, this project is no longer maintained because the methodology of static API does not ongoing increase in size.

This was causing a build time of 15 minutes on gh-pages. This project has been migrated to use Fast API so submissions can comments can be queried more easily.

A demo dashboard can be seen here

This repository serves as the source codes of a data crawler and website meant to be deployed on a remote machine.

Currently, at every second hour, it crawls the /r/amitheasshole subreddit of Reddit

The data obtained is then inserted into an SQLite database with a Submission and Comment table.

Upon completion, it will perform data preprocessing and transformation which includes

the creation of indexes, which will be used by the front-end demo website. These files are JSON files.
creation of JSON files for each submission and all comments linked to it. The name of this file is the Submission ID obtained above.
sentiment analysis of all the comments for a specific submission
generation of a search corpus to be used by Fuse.js

This current project is currently deployed to my RaspberryPi, moving away from the free AWS EC2 tier.

Dashboards

A demo dashboard can be seen here

ChartJS for charts.
d3 word cloud for the creation of word clouds
NRC Word-Emotion Lexicon for sentiment analysis

Refer here for more information regarding NRC Word-Emotion Lexicon

API

The API for this project uses a combination of GitHub Pages and Cloudflare CDN for a static readonly API.

The easiest way to use the API would be just to do a simple GET request to the endpoints.

The base URL for the API of the project is https://jianliew.me/reddit-store/api

curl https://jianliew.me/reddit-store/api/indexes/indexes.json

would retrieve a list of indexes

[
    {
        "id": 662,
        "scores": 27442,
        "created_utc": 1666957642
    },
    {
        "id": 12,
        "scores": 17278,
        "created_utc": 1667216880
    },
    ......
]

These ids can be used to query specific submissions using the Submission endpoint.

curl https://jianliew.me/reddit-store/api/submissions/662.json

would return

{
    "id": 662,
    "submission_id": "yfm01l",
    "title": "AITA for cancelling a check of $12,000 that I wrote for my infertile friend for her next IVF cycle over a joke?",
    "selftext": "\n\n\nI (F35) am infertile. My ex husband and I tried everything to have kids ...",
    "created_utc": 1666957642,
    "permalink": "/r/AmItheAsshole/comments/yfm01l/aita_for_cancelling_a_check_of_12000_that_i_wrote/",
    "score": 27442,
    "replies": [..........]}
}

The Submission JSON contains all the top-level replies for that specific Submission and can be very large.

There are also the results of the analytics. These analytics are obtained using rudimentary techniques and can be significantly improved upon.

curl https://jianliew.me/reddit-store/api/summary/662.json

which returns

{
    "id": 662,
    "afinn": 832,
    "emotion": {
        "negative": 1096,
        "fear": 448,
        "trust": 1134,
        "positive": 1547,
        "anticipation": 689,
        "joy": 1057,
        "anger": 724,
        "disgust": 418,
        "sadness": 505,
        "surprise": 432
    },
    "word_freq": {
        "nta": 326,
        "friend": 272,
        "money": 213,
        "joke": 201
        ..........
    },
    "counts": {
        "nta_count": 326,
        "yta_count": 1,
        "esh_count": 0,
        "info_count": 1,
        "nah_count": 1
    }
}

Important

Please note that the NRC Emotion Lexicon used to obtain the emotions are meant for research and academic use only. Please refer to the author for more information from this website
If you do not want to deal with APIs, you can contact me for the .sqlite database file. The current size is roughly 50 MB with 1400 submissions and 140,000 comments.

Name		Name	Last commit message	Last commit date
Latest commit History 4,214 Commits
docs		docs
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
analytics.py		analytics.py
automatic-push.sh		automatic-push.sh
cron		cron
package.json		package.json
requirements.txt		requirements.txt
search.py		search.py
subreddit-aitah.py		subreddit-aitah.py
subreddit.py		subreddit.py
top.py		top.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs

docs

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

analytics.py

analytics.py

automatic-push.sh

automatic-push.sh

cron

cron

package.json

package.json

requirements.txt

requirements.txt

search.py

search.py

subreddit-aitah.py

subreddit-aitah.py

subreddit.py

subreddit.py

top.py

top.py

Repository files navigation

Reddit Data

Dashboards

API

Important

About

Releases

Packages

Languages

JianLoong/reddit-store

Folders and files

Latest commit

History

Repository files navigation

Reddit Data

Dashboards

API

Important

About

Topics

Resources

Stars

Watchers

Forks

Languages