This is an educational project and proof-of-concept.
Apache Airflow is a platform to programmatically author, schedule and monitor workflows using Python.
The pipeline does
- Extract data from Twitter: date, user, content, source, location of all
#ChatGPT
tweets since2023-01-01
(see file dags/extract.py), - Transform data: remove all non-ascii charaters, remove line-breaks, etc.
- Load data into a Postgres database.
Pipeline task graph:
airflow dags show etl_twitter | sed 1d | graph-easy --as=boxart
etl_twitter
╭──────────────╮ ╭──────────────╮ ╭────────────────╮ ╭───────────╮
│ create_table │ ──▶ │ extract_data │ ──▶ │ transform_data │ ──▶ │ load_data │
╰──────────────╯ ╰──────────────╯ ╰────────────────╯ ╰───────────╯
brew install postgresql@14
brew services run postgresql@14
createdb -h localhost -p 5432 -U <USER> twitter
git clone https://github.com/thomd/twitter-etl-pipeline-with-airflow.git
cd twitter-etl-pipeline-with-airflow
pyenv shell 3.10.9
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
export AIRFLOW_HOME=$(pwd)
URL="https://raw.githubusercontent.com/apache/airflow/constraints-2.5.3/constraints-3.10.txt"
pip install "apache-airflow==2.5.3" --constraint "${URL}"
pip install psycopg2-binary apache-airflow-providers-postgres
export SQLALCHEMY_SILENCE_UBER_WARNING=1
export AIRFLOW__CORE__LOAD_EXAMPLES=False
airflow db init
pip install snscrape pandas
airflow connections add --conn-type postgres --conn-host localhost --conn-schema twitter --conn-login <USER> pg_connection
airflow scheduler -D
airflow dags list
airflow dags unpause etl_twitter
airflow dags trigger etl_twitter
airflow dags list-runs -d etl_twitter
airflow users create -u airflow -p airflow -r Admin -f John -l Doe -e john.doe@airflow.apache.org
airflow webserver -p 8080 -D