twitter-timeline-scraper

Lightweight scripts for scraping the algorithmically-curated Twitter timeline

Audit Your Timeline Algorithm in 5 Minutes

Clone the repository via git clone https://github.com/comp-journalism/twitter-timeline-scraper.git
Install required packages via pip install -r requirements.txt
Download a chromedriver that matches your Chrome version
Add your environment variables in scrape_user_timeline.py

path_to_chromedriver is the location of the download in step 3
data_path is the folder where you want to save the collected data (two csv files)

run python scrape_user_timeline.py, log in to Twitter, and enjoy!

Checklist for EC2

(see https://gist.github.com/ziadoz/3e8ab7e944d02fe872c3454d17af31a5 for guidance)

sudo timedatectl set-timezone America/Chicago (to set the time zone)
Get the tools:

sudo apt update
sudo apt upgrade
sudo apt install unzip
sudo apt install python3-pip
pip3 install selenium pandas lxml
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo apt install ./google-chrome-stable_current_amd64.deb

dpkg -s google-chrome-stable | grep Version (to check version)
wget https://chromedriver.storage.googleapis.com/80.0.3987.106/chromedriver_linux64.zip (with version from previous command)
Install chrome driver:

unzip ~/chromedriver_linux64.zip -d ~/
rm ~/chromedriver_linux64.zip
sudo mv -f ~/chromedriver /usr/local/bin/chromedriver
sudo chown root:root /usr/local/bin/chromedriver
sudo chmod 0755 /usr/local/bin/chromedriver

Selenium (check versions):

SELENIUM_STANDALONE_VERSION=3.141.59
SELENIUM_SUBDIR=$(echo "$SELENIUM_STANDALONE_VERSION" | cut -d"." -f-2)
wget -N https://selenium-release.storage.googleapis.com/$SELENIUM_SUBDIR/selenium-server-standalone-$SELENIUM_STANDALONE_VERSION.jar -P ~/
sudo mv -f ~/selenium-server-standalone-$SELENIUM_STANDALONE_VERSION.jar /usr/local/bin/selenium-server-standalone.jar
sudo chown root:root /usr/local/bin/selenium-server-standalone.jar
sudo chmod 0755 /usr/local/bin/selenium-server-standalone.jar

to get around Twitter's apparent selenium detection: https://stackoverflow.com/questions/33225947/can-a-website-detect-when-you-are-using-selenium-with-chromedriver

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrape_user_timeline.py		scrape_user_timeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

scrape_user_timeline.py

scrape_user_timeline.py

Repository files navigation

twitter-timeline-scraper

Audit Your Timeline Algorithm in 5 Minutes

Checklist for EC2

About

Languages

License

comp-journalism/twitter-timeline-scraper

Folders and files

Latest commit

History

Repository files navigation

twitter-timeline-scraper

Audit Your Timeline Algorithm in 5 Minutes

Checklist for EC2

About

Topics

Resources

License

Stars

Watchers

Forks

Languages