Skip to content

Arturomtz8/de-zoomcamp-project

Repository files navigation

Paranormal subreddits ELT

Objective

This project involves extracting, loading, and transforming top posts and comments from four subreddits: Ghoststories, Ghosts, Paranormal, and ParanormalEncounters. By analyzing the data, I aim to determine the hourly distribution of posts, the correlation between post scores and comment numbers, the average number of comments per post, the maximum post score, the number of unique authors, and the most frequently used words in posts' text. To access the data, I use PRAW, which is the Python Reddit API Wrapper that simplifies the process of interacting with Reddit's posts, comments, and subreddits.

Architecture

Technologies used

Data pipeline

A lot of the orchestration of the project is done via Github Actions located in .github/workflows/ and Prefect flows which are in flows/.

Github Actions is mainly used for running jobs (python scripts and dbt commands) via a cronjob and Prefect is responsible for creating the flows and connecting to Google Cloud services in a secure way using Blocks and Secrets.

To ensure data consistency, each query made to PRAW stores the results in both Google Cloud Storage and Big Query. I also take care to exclude any posts or comments already stored in these locations to avoid duplication.

At four intervals throughout the day (3 am, 9 am, 3 pm, and 9 pm), I execute a query on the top posts from the selected subreddits.

At 3:50 am, I query the comments of each post to extract information about the most commonly used words and other details.

Finally, at 4:10 am, I use dbt to clean and prepare the data stored in Big Query for analysis, and then serve it in Google Looker Studio.

With all the posts and comments saved in Google Cloud Storage, I generate wordclouds and graphs that display the most commonly used words found in the post titles, post text, and comment bodies, which are saved here:

Prerequisites for running the project

  • In case that you don't have one, create a project in Google Cloud, you can follow the docs

  • Download Terraform and move into the terraform folder cd terraform/ and run terraform init for initializing a working directory containing Terraform configuration files

    • Create one environment variable in your terminal with the name TF_VAR_project that contains your project id from Google Cloud:
      $ export TF_VAR_project=your_project_id_from_google_cloud
    • Then build the infraestructure with the following commands:
      $ terraform plan -out init_infra.tfplan
      $ terraform apply init_infra.tfplan
  • Python >=3.9 and <3.11

  • To set up the required dependencies for running the Python scripts in this repository, follow the instructions in the documentation to install Poetry. Then, navigate to the root directory of the repository and run the command poetry install. This command creates a new environment based on the dependencies specified in pyproject.toml, ensuring that you have the same library versions as I use and making it easy to run the scripts

  • You will also have to create an agent that will interact with PRAW, here are the steps for doing it: https://praw.readthedocs.io/en/stable/getting_started/quick_start.html#prerequisites

  • Use Prefect Cloud for configuring Secrets and Blocks that are used in the flows scripts. Here is a guide for configuring Prefect Cloud. And here are explained the concepts of Blocks and Secrets

  • Create a project in dbt with the name dbt_reddit:

    $ poetry run dbt init dbt_reddit
  • Optional If you want to run it in Github Actions, you will have to create the following secrets for your repo:

    • DBT_ENV_SECRET_GOOGLE_DATASET (the google dataset where is your table)
    • DBT_ENV_SECRET_PROJECT_ID (your google project id)
    • EMAIL (your email linked to Github)
    • KEYFILE_CONTENTS (the contents of the json file from your google service account)
    • PREFECT_API_KEY (your api key for connecting to Prefect Cloud)
    • PREFECT_WORKSPACE (the name of your workspace in Prefect Cloud)

Run the scripts

It is mandatory to have done the steps in Prerequisites section

  • To run the prefect scripts and scrape the top posts and comments of the subreddits simply type in the root dir of the repo:

    $ poetry run python flows/reddit_posts_scraper.py
    $ poetry run python flows/reddit_comments_scraper.py
  • To update the Big Query tables with the contents of Google Cloud Storage, you will have to run:

    # for updating the Big Query table of posts
    $ poetry run python flows/update_posts_in_bq.py
    # for updating the Big Query table of comments
    $ poetry run python flows/update_comments_in_bq.py
  • To generate plots of the most commonly used words in the titles, text, and comments of the posts, run:

    $ poetry run python flows/create_plots.py
  • To run dbt models:

    $ cd dbt_reddit/
    $ poetry run dbt run

The dashboard

All of the graphs are created using data from Big Query and are designed to be modified dynamically in response to user inputs via the tiles interface. However, the wordcloud of the posts' body is a static representation of the most frequent words and does not change dynamically in response to the tiles.

Link to the dashboard