Twitter Conversation Crawler

An easy implementation of crawling conversational tweet threads. You may use it for building dialog datasets for training chatbots based on machine learning.

This requires BEARER_TOKEN of Twitter Developer API v2.

Crawl

Modify LANG = "en" in search.py, if you want tweets in other languages like Japanese ("ja").

export BEARER_TOKEN='_____YOUR_BEARER_TOKEN_HERE_____'
python -u search.py > tweets.jsonl

Keep the script running for crawling. Each line is the json of a crawled result based on conversation_id.

Extract Conversation Chains

After some crawling, you can extract the data as conversational chain. Each line is the json of a list of tweet dicts.

python extract.py < tweets.jsonl > convs.jsonl

Data Examples

a line of `tweets.jsonl` (from `search.py`)

{
  "data": [
    {
      "author_id": "2879237869",
      "conversation_id": "1362605931436019724",
      "created_at": "2021-02-19T11:57:29.000Z",
      "entities": {
        "mentions": [
          {
            "end": 9,
            "start": 0,
            "username": "xcanamem"
          },
          {
            "end": 23,
            "start": 10,
            "username": "xbeniha_trpg"
          }
        ]
      },
      "id": "1362733036882649088",
      "in_reply_to_user_id": "1273448559900110848",
      "lang": "ja",
      "possibly_sensitive": false,
      "referenced_tweets": [
        {
          "id": "1362731511473086465",
          "type": "replied_to"
        }
      ],
      "reply_settings": "following",
      "source": "Twitter for Android",
      "text": "@xcanamem @xbeniha_trpg わーい。ありがとうございます！
    },
    {
      "author_id": "1273448559900110848",
      "conversation_id": "1362605931436019724",
      "created_at": "2021-02-19T11:51:25.000Z",
      ...
      "text": "@xbeniha_trpg3 @xKanraKor 愛しています！"
    },
    ...
  ],
  "includes": {
    "users": [
      {
        "created_at": "2014-10-27T07:40:58.000Z",
        "description": "TRPG専用垢。",
        "id": "2879237869",
        "name": "あいうえお太郎",
        "protected": false,
        "public_metrics": {
          "followers_count": 44,
          "following_count": 52,
          "listed_count": 1,
          "tweet_count": 5287
        },
        "username": "xKanraKoro"
      },
      ...
    ]
  },
  "meta": {
    "newest_id": "1362733036882649088",
    "next_token": "b26v89c19zqg8o3fosns33kwp7sfdsafsacdscscdsafsa",
    "oldest_id": "1362728096110022663",
    "result_count": 10
  }
}

a line of `convs.jsonl` (from `extract.py`)

[
  {
    "author_id": "2879237869",
    "conversation_id": "1362605931436019724",
    "created_at": "2021-02-19T11:37:51.000Z",
    ...
    "text": "@xcanamemm @xbeniha_trpg3 任せました！"
  },
  {
    "author_id": "1340661951769104384",
    "conversation_id": "1362605931436019724",
    "created_at": "2021-02-19T11:38:45.000Z",
    ...
    "text": "@xKanraKoro @xcanamemm 了解です！！！"
  },
  ...
]

Velocity of Crawling

The twitter API has a MONTHLY TWEET CAP USAGE (typically, 500000/month).
- You can check the remaining MONTHLY TWEET CAP USAGE on the Dashboard.
- (And also, it has the caps of the number of requests per 15-minute. See the doc.)
One loop usually accesses around 100 tweets with variances.
So, we can run 7 loops per hour at maximum (500000/30/24/100 = 6.94...).
The default velocity is set as 5 loops per hour (12 min interval). If you wanna change, modify search.py.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
LICENSE		LICENSE
README.md		README.md
extract.py		extract.py
search.py		search.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

extract.py

extract.py

search.py

search.py

Repository files navigation

Twitter Conversation Crawler

Crawl

Extract Conversation Chains

Data Examples

a line of `tweets.jsonl` (from `search.py`)

a line of `convs.jsonl` (from `extract.py`)

Velocity of Crawling

About

Releases

Packages

Languages

License

soskek/twitter_conversation_crawler

Folders and files

Latest commit

History

Repository files navigation

Twitter Conversation Crawler

Crawl

Extract Conversation Chains

Data Examples

a line of tweets.jsonl (from search.py)

a line of convs.jsonl (from extract.py)

Velocity of Crawling

About

Resources

License

Stars

Watchers

Forks

Languages

a line of `tweets.jsonl` (from `search.py`)

a line of `convs.jsonl` (from `extract.py`)