Twitter Bird Watcher: A Twitter Profile Archival Tool

TBWatcher snapshots a profile page when given a URL (or an exported .js list from the official Twitter exporter.) Supports UTF-8 text JSON files and image snapshots of each Twitter post!

This script is purely for the purposes of archival use only.

Quick Highlights

⚡ Multi-threaded!
🗄️ Neatly stores metadata in json format for each specified twitter profile.
📸 Snapshots tweets, thread replies, and reponses.
♻️ Marks potential tweets that are self-retweeted.
🚩 Removes Tweet Ads.
🖥️ Allows for manual login (use at own risk.)

Usage

# Install the requirements. Once only.
python -m pip install -r requirements.txt

# Take a snapshot from a given profile URL.
python bin/watcher.py --url www.twitter.com/<profile>

# Take a snapshot of profile tweets and their replies
python bin/watcher.py --url www.twitter.com/<profile> -d 2

# For more help use:
python bin/watcher.py --help

Tested on Python 3.10.

Output

TBwatch generates the following in the snapshots folder (assuming --depth 2):

└───snapshots
    └───<user_id>           # Username
        │   metadata.json   # profile metadata
        │   profile.png     # snapshot of profile page
        │   tweets.json     # text format of all tweets on profile page
        │
        └───<prof_tweet_id_0>
            │   <prof_tweet_id_0>.png  # Snapshot
            │   tweets.json            # Responses to <prof_tweet_id_0>
            │
            ├───<response_tweet_id_0>
            │       <response_tweet_id_0>.png # Snapshot
            │
            └───<response_tweet_id_1>
                    <response_tweet_id_1>.png # Snapshot

Detailed Highlights

Multi-Threading

By default, multi-threading is enabled and proportional to the number of cores on your computer. Each thread spawns a unique window. Resist the urget to resize the windows as it can mess up the renders. But you can move the windows around.

If you find yourself out of memory, consider lowering the number of threads.

Self Boosted Tweet Detection

A self-boosted tweet is a tweet where the original author retweets. These types of tweets are marked with potential_boost as true in tweets.json. The script detects these by matching exact meta-datas e.g. duplicate posts.

Schemas

Assume all data is UTF-8 compliant.

Input File

These files are what the Twitter exporter should generate (.js file) from the users you are following:

window.* = [
    {
        "following": {
            "accountId": <id>,
            "userLink": <url>
        }
        ...
    }
]

You can rename as json or specify via input flags to parse the file. window.* = is automatically removed by the script and is default generated by Twitter. However, you can also manually remove it to parse the file as JSON directly.

tweets.json

[
    {
        "id": int,
        "tag_text": str,
        "name": str,
        "handle" str,
        "timestamp": str,
        "tweet_text": str,
        "retweet_count": str,
        "like_count": str,
        "reply_count": str,
        "potential_boost":  bool,
        "parent_id": str | null
    }
]

id is the index assigned by Twitter. Invalid string entries will be marked as "NULL".

metadata.json

{
    "bio": str,
    "name": str,
    "username": str,
    "location": str,
    "website": str,
    "join_date": str,
    "following": str,
    "followers": str
}

Invalid string entries will be marked as "NULL".

Troubleshoot

TBWatcher terminates early?

It is possible that your images are taking sometime to load. Consider using -s to adjust load-time. Or your scrolling height is too low / too high. Consider using --scroll-algorithm to adjust the type of algorithm Then passing in a value to the algorithm --scroll-value.

"--help" has more information as to what --scroll-value encodes.

TBWatcher does not scrape anything or tweet cut-off?

Try to run with --debug and see if there are any "Unable to locate element" errors. If so, your render window size may be a bit too small. Under-the-hood we use Chrome to render tweets, which requires a browser window size that is sufficiently large.

Try to modify --window-size such that each tweet is clearly rendered.

Out of memory issues?

Each thread spawns a unique Chrome window. Try reducing number of threads with -t / --multi-threading.

Contributing

Intrested in contributing? Take a look at our CONTRIBUTING.md

Future Updates and Goals

Support Running Multiple Sessions to Resume Per-Profile Fetching
Save and Expand Post Attachments

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
bin		bin
src/tb_watcher		src/tb_watcher
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
demo.gif		demo.gif
logo.png		logo.png
multi_threading.gif		multi_threading.gif
requirements.txt		requirements.txt

License

ProgrammingIncluded/tb-watcher

Folders and files

Latest commit

History

Repository files navigation

Twitter Bird Watcher: A Twitter Profile Archival Tool

Quick Highlights

Usage

Output

Detailed Highlights

Multi-Threading

Self Boosted Tweet Detection

Schemas

Input File

tweets.json

metadata.json

Troubleshoot

Contributing

Future Updates and Goals

About

Topics

Resources

License

Stars

Watchers

Forks

Languages