GitHub - seacevedo/Solana-Pipeline: Data Engineering Project to Extract and Process Solana Reddit Data

Purpose

Solana is a new blockchain protocol that has garrnered interest in the technology space due to its unique proof of history consensus algorithm that results in lightning fast transaction times. However, it has been plagued by numerous outages and scandals, most notably the bankruptcy of FTX, which invested heavily into the Solana ecosystem. Due to this, it would be helpful to keep track of the development and community sentiment around this growing ecosystem. Solana could be useful as a case study to be able to understand what makes a blockchain protocol successful and able to maintain community and developer support. The aim of this project is to build an automated pipeline to gather and synthesize data from community discussion around Solana. Using the PRAW API we will extract posts and comments from the Solana subreddit. This data will be transformed and then summarized into a dashboard to extract insight from the discussions of the burgeoning Solana community.

Questions This Project Seeks to Answer

Who are the top posters on the subreddit?
What posts have the most engagement? (Most comments, upvotes, etc.)
What is the average sentiment over time?
Do the posts who have the most upvotes always have the most comments?

Technology Stack

PRAW API to scrape for posts and associated comments from the Solana subreddit.
Google Cloud to upload data to a google cloud bucket and use BigQuery as our data warehouse. We will also set up a VM environment to host our prefect deployment.
Terraform for version control of our infrastructure.
Prefect will be used to orchestrate and monitor our pipeline.
SpaCy is an NLP library that we will use to calculate sentiment for both subreddit posts and comments.
DBT to transform out data in BigQuery and prepare it for visualization using SQL.
Looker Studio to visualize our transformed dataset.
Pandas to import and transform our dataset.

Pipeline Architecture

Terraform is used to setup the environment to run our pipeline. When run, the script creates our BigQuery dataset, bucket, and our VM to run our Prefect deployment.
A Prefect agent is run on our VM compute environment and runs any pending deployments. Using the PRAW API, our flow extracts data from the Solana subreddit, calculates the sentiment of the text data and loads it into a GCS bucket. The data from the bucket is then uploaded to BigQuery tables for further transformation using DBT.
DBT is used to further transform the data by combining the comments and posts data with their corresponding sentiment data. Data aggregations are also made to caluclate the number of comments for each post and calculate upvote/comment ratios.
Looker studio is used to visualize the fact data tables from the resulting transformations completed by DBT.

Structure of the Fact Tables

fact_submission

Column	Data Type	Description
post_id	STRING	Surrogate key created using DBT using submission_id and submission_created_time
submission_id	STRING	Reddit Submission ID
submission_author	STRING	Author of post
submission_title	STRING	Title of Post
submission_text	STRING	Text of post
sentiment_score	NUMERIC	Sentiment score of post
sentiment_label	STRING	Whether the text is postive, negative, or neutral
positive_words	STRING	postive words found in post
negative_words	STRING	negative words found in post
submission_upvotes	INTEGER	Number of upvotes post has
submission_downvotes	INTEGER	Nunber of downvotes post has
submission_created_time	TIMESTAMP	Time post was created in UTC
submission_url	STRING	URL post links to

fact_comment

Column	Data Type	Description
reply_id	STRING	Surrogate key created using DBT using comment_id and comment_created_time
comment_id	STRING	Reddit Comment ID
submission_id	STRING	Reddit Submission ID
comment_author	STRING	Author of comment
comment_text	STRING	Text of comment
sentiment_score	NUMERIC	Sentiment score of comment
sentiment_label	STRING	Whether the text is postive, negative, or neutral
positive_words	STRING	postive words found in comment
negative_words	STRING	negative words found in comment
comment_upvotes	INTEGER	Number of upvotes comment has
comment_downvotes	INTEGER	Nunber of downvotes comment has
comment_created_time	TIMESTAMP	Time comment was created in UTC
comment_url	STRING	URL comment links to

fact_engagement

Column	Data Type	Description
post_id	STRING	Surrogate key created using DBT using submission_id and submission_created_time
submission_id	STRING	Reddit Submission ID
submission_author	STRING	Author of post
submission_title	STRING	Title of Post
submission_upvotes	INTEGER	Number of upvotes post has
submission_downvotes	INTEGER	Nunber of downvotes post has
num_comments	INTEGER	total number of comments the post has
upvote_comment_ratio	FLOAT	ratio of the number of upvotes divided by the number of comments. Metric to measure engagement.
submission_created_time	TIMESTAMP	Time post was created in UTC

Lineage Graph

staging.submission, staging.sentiments, and staging.comments are tables where we have loaded all of our collected subreddit data.
stg_submission, stg_sentiments, and stg_comments are views that apply some type casting and create new surrogate keys. This is to ensure that the api does not somehow extract the same id for two different posts.
fact_submission and fact_comments are tables that join the stg_submission and stg_comments with stg_sentiments. They are partitioned by the time of creation and clustered by sentiment label.
fact_engagement is a table that joins the fact_submission and fact_comments and applies some aggregations to calculate the number of comments per post and caluclate the upvote comment ratio.

Dashboard Preview

Access the dashboard here

Replication Steps

Create a Reddit Account

You must create a reddit account in order to be able to scrape subreddit data.
Once you have an account go here. Click on the create another app... button at the bottom of the page. Fill in the required fields and select web app. Click on create app
Under the web_app text under your newly created app information you should have an id for your application. Record this and the secret key. You will need to parameterize your flow with these tokens in order to use the PRAW apiSetup Google Cloud

Setup Google Cloud

Create a google cloud account
Setup a new google cloud project.
Create a new service account. Give the service account the Compute Admin, Service Account User, Storage Admin, Storage Object Admin, and BigQuery Admin Roles.
After the service account has been created, click on Manage Keys under the Actions Menu. Click on the Add Key dropdown and click on Create new key. A prompt should pop up asking to download it as a json or P12 file. Choose the json format and click Create. Save your key file.
Install the the Google Cloud CLI. Assuming you have an Ubuntu linux distro or similar as your environment, follow the directions for Debian/Ubuntu. Make sure you log in by running gcloud init. Choose the cloud project you created to use.
Set the environment variable to point to your downloaded service account keys json file:

export GOOGLE_APPLICATION_CREDENTIALS=<path/to/your/service-account-authkeys>.json

Refresh token/session, and verify authentication gcloud auth application-default login
Make sure these APIs are enabled for your project:

https://console.cloud.google.com/apis/library/iam.googleapis.com https://console.cloud.google.com/apis/library/iamcredentials.googleapis.com https://console.cloud.google.com/apis/library/compute.googleapis.com

Setup VM Environment

Clone the repo and cd into the Solana-Pipeline folder
Install Terraform
Make sure you have an SSH key generated to log into the VM.
cd to the terraform directory and enter the commands terraform init, terraform plan, and terraform apply. You can remove the corresponding infrastructure by using terraform destroy. For the plan and destroy commands you will be prompted to input the following variables:

Variable	Description
GOOGLE_CLOUD_PROJECT_ID	ID of the google cloud project
SERVICE_ACCOUNT_EMAIL	Email of the service account you used to generate the key file
SSH_USER	Username you used for the SSH Key
SSH_PUBLIC_KEY_PATH	Path to the public SSH key
SETUP_SCRIPT_PATH	Path of the setup.sh script within the Solana-Pipeline directory
SERVICE_ACCOUNT_FILE_PATH	Keyfile of the service account
GOOGLE_CLOUD_BUCKET_NAME	Name of your GCS bucket

Log in your newly created VM environment using the following command ssh -i /path/to/private/ssh/key username@vm_external_ip_address. As an alternative, follow this video to help setup SSH in a VS code environment, which allows for port forwarding from your cloud VM to your local machine. Type the command cd /Solana-Pipeline to cd into the /Solana-Pipeline directory. Login as super user with the command sudo su in order to edit files.
Activate the newly created python virtual environment using the command: source solana-pipeline-env/bin/activate (You may have to wait a few minutes in order for the vm instance to finish running the setup.sh script and installing all necessary dependancies).
Make sure you update the profiles.yml and schema.yml files in your dbt directory. Change the appropriate fields with the correct google cloud project id, dataset name, location, and keyfile path.
You should now have prefect installed. Run the prefect server locally using the prefect orion start command to monitor flows. This is needed to access the Orion UI. In another terminal, cd into the flows directory and run the command prefect deployment build main_flow.py:run_pipeline -n "solana-pipeline-deployment" --cron "0 0 * * *" -a to build the prefect deployment that runs at 12:00 AM every day. Make sure you setup the following prefect blocks before running:

Block Name	Description
crypto-gcp-creds	Block pertaining to your Google cloud credentials. You need the JSON keyfile you downloaded earlier to set it up
crypto-reddit	Block pertaining to the bucket you wish to load the data into

You can then run the deployment using the command prefect deployment run run-pipeline/solana-pipeline-deployment --params '{"client_id":<client_id>, "client_secret":<client_secret>, "reddit_username":<reddit_username>, "bucket_name":"solana_reddit", "dbt_dir":"/Solana-Pipeline/solana_subreddit_dbt/", "subreddit":"solana", "subreddit_cap":100, "bq_dataset_location":"us-central1", "num_days":15}' as an example. The deployment should be scheduled.
Your newly scheduled deployment can be run when initiating a prefect agent. Run the command prefect agent start -q "default" to run your deployment.

Next Steps

Add CI/CD tooling to automate the workflow and make it more production ready
Take advantage of systemd to run the agent when the VM starts up
Add Piperider to allow for data profiling and monitoring
Add docker containers to aid with reproducibility of the presented pipeline.

Name		Name	Last commit message	Last commit date
Latest commit History 101 Commits
flows		flows
solana_subreddit_dbt		solana_subreddit_dbt
terraform		terraform
README.md		README.md
dashboard_preview.png		dashboard_preview.png
lineage_graph.png		lineage_graph.png
pipeline_infrastructure.png		pipeline_infrastructure.png
requirements.txt		requirements.txt
setup.sh		setup.sh

seacevedo/Solana-Pipeline

Folders and files

Latest commit

History

Repository files navigation

Purpose

Questions This Project Seeks to Answer

Technology Stack

Pipeline Architecture

Structure of the Fact Tables

fact_submission

fact_comment

fact_engagement

Lineage Graph

Dashboard Preview

Replication Steps

Create a Reddit Account

Setup Google Cloud

Setup VM Environment

Next Steps

About

Topics

Resources

Stars

Watchers

Forks

Languages