Skip to content

Code to download a user's tweet from Twitter bypassing the API limit of 3200

License

Notifications You must be signed in to change notification settings

analyticsbot/user-tweet-download

Repository files navigation

user-tweet-download

Narendra Modi Twitter page

Timeline of Mr. Narendra Modi, Prime Minister of India

Donald Trump home page snapshot

Timeline of Mr. Donald Trump, President of the USA

User Tweet Download

Note: Please use at your own discretion. I used this code to pull data for 2 twitter handles for research purposes. Twitter provides API to download tweets

Download User tweets

This code can be used to download a User's weets from Twitter.com. This can help to bypass the 3200 limit that is put in by the Twitter API. The code is provided as a Jupyter notebook and as Python file.

Requirements

  • requests
  • tweepy
  • selenium
  • pandas

How to Run?

  • Install the necessary dependencies
  • Create a folder with the twitter user name or any suitable name
  • Copy all files into the same directory
  • Make changes to the config.py files -
  • Run the program - download_tweets_user.ipynb or download_tweets_user.py

What happens when the program is run?

The code first uses the API to fetch the most recent 3200 tweets and then uses selenium to distribute any other tweets on the worker nodes based on the dates

What datapoints are provided?

  • Tweet text - denoted by text
  • Number of replies to the tweet - denoted by replies_count
  • Number of retweets to the tweet - denoted by retweet_count
  • Number of times this tweet has been favorited - denoted by favorite_count
  • Url of the tweet - denoted by tweet_url
  • Creation date/time of the tweet - denoted by created_date
  • If a video was attached to the tweet, what is the url - denoted by video_url
  • If a video was attached to the tweet, how many times it is viewed - denoted by video_views
  • The twitter username - denoted by screen_name
  • The language of the tweet - denoted by language

Tradeoffs

Chrome and Firefox, both can be used to download the selenium part of the tweets. Adding more threads to them can make the process faster, but can give rise to issues such as getting throttled by Twitter or too many browsers eating a lot of RAM. The number of threads need to be optimized for the workload as explained below

Default

Variable Name Description Default value
DATE_IN_PAST Download tweets until this date in the past, if available. This data should be at least the creation date of the account. Defaults to creation date if earlier than creation date 01-01-2020
DAYS_IN_PAST

Download tweets untils these many days in past. This is similar to DATE_IN_PAST.

The earlier of DATE_IN_PAST and DAYS_IN_PAST is used

5
NUM_TWEETS_TO_DOWNLOAD

Number of tweets to download for the user. 

The earlier of 

NUM_TWEETS_TO_DOWNLOAD, DATE_IN_PAST and DAYS_IN_PAST is used when there is a conflict. If this value is less than 3200, tweets only with the API are downloaded. Please keep this value more than 3200 to download all available tweets

100
OUTPUT_FILE_NAME_SUFFIX Add any suffix name to the file None
TIME_SLEEP Time to sleep between each page load in Selenium. This is to avoid any detection from the server and thus throttling the connection requests. Ideally this should be kept keeping in mind the total number of tweets of the user and the time that should be spent to download the tweets 5
TIME_SLEEP_BROWSER_CLOSE The selenium browser is closed and opened to delete any possible cookies. Other details as above 2

Twitter

Variable Name Description Default value
TWITTER_USER_NAME The twitter username without quotes  
CONSUMER_KEY The consumer key of the Twitter developer API  
CONSUMER_SECRET The consumer secret of the Twitter developer API  
ACCESS_TOKEN The access token of the Twitter developer API  
ACCESS_TOKEN_SECRET The access token secret of the Twitter developer API  

Chrome

Variable Name Description Default value
CHROME_GECKODRIVER_LOCATION The location of already downloaded chromedriver from https://chromedriver.chromium.org/downloads, else it is downloaded from the web based on the operating system None
USE_CHROME Use chrome to download the tweets via selenium (bool) 0
NUM_THREADS_CHROME

Number of threads to use. Each thread will have it's own chrome browser. This should depend on the number of tweets to download, the urgency, and the capacity of the system

 

If USE_CHROME is True and NUM_THREADS_CHROME is 0, NUM_THREADS_CHROME defaults to 1

1
linux64 Edit the URL if your system is a linux based system https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_linux64.zip
windows Edit the URL if your system is a windows based system https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_win32.zip
macos Edit the URL if your system is a mac based system https://chromedriver.storage.googleapis.com/83.0.4103.14/chromedriver_mac64.zip

Firefox

Variable Name Description Default value
FIREFOX_GECKODRIVER_LOCATION The location of already downloaded geckodriver from https://github.com/mozilla/geckodriver/releases, else it is downloaded from the web based on the operating system None
USE_FIREFOX Use firefox to download the tweets via selenium (bool) 1
NUM_THREADS_FIREFOX

Number of threads to use. Each thread will have it's own firefox browser. This should depend on the number of tweets to download, the urgency, and the capacity of the system

 

If USE_FIREFOX is True and NUM_THREADS_FIREFOX is 0,

NUM_THREADS_FIREFOX defaults to 1

1
macos Edit the URL if your system is a mac based system https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-macos.tar.gz
linux32 Edit the URL if your system is a linux 32 bit based system https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux32.tar.gz
linux64 Edit the URL if your system is a linux 64 bit based system https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-linux64.tar.gz
windows32 Edit the URL if your system is a windows 32 bit based system https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-win32.zip
windows64 Edit the URL if your system is a windows 64 bit based system https://github.com/mozilla/geckodriver/releases/download/v0.26.0/geckodriver-v0.26.0-win64.zip

Discuss

Feel free to post any questions or comments or bugs. Twitter UI changes from time to time and hence the selenium part might break


References

Twitter Search:

Regular Syntax: https://twitter.com/search-advanced

Advanced Syntax: https://help.twitter.com/en/using-twitter/advanced-tweetdeck-features

Tweepy API

API Reference: http://docs.tweepy.org/en/latest/api.html

Cursor Tutorial: http://docs.tweepy.org/en/latest/cursor_tutorial.html

About

Code to download a user's tweet from Twitter bypassing the API limit of 3200

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published