GitHub - ThomasNose/data-web-scrape: A youtube web scraper for fun and intrigue

This code is being designed to collect data from youtube's homepage, as an anonymous user, as part of an ETL pipeline to gain some insight into the trends, tags, views, titles, etc throughout the duration of the day.

Current plan is

python code scrapes data - Done
put data into a pandas dataframe with ~30-60 minute time deltas (to allow for homepage update) - Done-ish
store data as parquet? - Done
move data to postgresql server - Done

This will hopefully be orchestrated by airflow (or other scheduling software) for automation of code

INSTALLATION INSTRUCTIONS

Can be completed via the script "setup.sh"

run setup.sh
source .venv/bin/activate
pip install -r requirements.txt

SCRIPTS AND WHAT THEY DO

youtube.py collects html code from the youtube homepage as an anonymous user and organises it into obvious variable names
dataframe.py takes the data from youtube.py and puts it into a dictionary. The data is then stored as parquet files
dataframe-to-db.py uses the parquet files to create temporary csv files which are they used to concat into one large dataframe by looping over all the parquet files (for each run of web scraping). Once all files have been looped over and added to a single dataframe, the data is pushed to a local postgresql server database which has all values replaced to avoid duplication. (planning for time deltas)
These are the 3 main python scripts in utils/

There are files in local/ which serve as basic code examples for scraping certain bits of data from certain parts of youtube e.g. channel.py is a youtube channel homepage, home.py is the youtube homepage etc.

channel, home, and video rely on .html files to function
test.py is literally meant for random testing

ISSUES

The initial scrape function sometimes stops looping despite being coded to run intil data is gathered (youtube might have a timeout/request limit) Unable to get the likes/dislikes of a youtube video easily

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
local		local
utils		utils
README.md		README.md
install-packages.sh		install-packages.sh
requirements.txt		requirements.txt
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

local

local

utils

utils

README.md

README.md

install-packages.sh

install-packages.sh

requirements.txt

requirements.txt

setup.sh

setup.sh

Repository files navigation

About

Releases

Packages

Languages

ThomasNose/data-web-scrape

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages