How to retrieve only a single Tweet, not subtweets #99

cguess · 2023-01-30T22:53:25Z

Running the sample code for getting a tweet with a single id actually scrapes dozens of tweets taking 30+ seconds and return a massive List. Is it possible to just get the first tweet object and then stop running so it can be more time efficient?

I've tried messing around with the context, but I can't seem to wrap my head around how that's actually used in this project. I could perhaps be being dense however, and any clarification would be greatly appreciated.

oneroyalace · 2023-03-05T01:59:24Z

@cguess is this what you were looking for?

search_tweets_task = stweet.SearchTweetsTask(from_username=tweet_author, replies_filter=RepliesFilter.ONLY_ORIGINAL)

stweet/stweet/search_runner/search_tweets_task.py

Line 29 in fe34e98

replies_filter: Optional[RepliesFilter]

stweet/stweet/search_runner/replies_filter.py

Lines 6 to 10 in fe34e98

    
           class RepliesFilter(enum.Enum): 
        
               """Domain RepliesFilter enum class.""" 
        
               ONLY_REPLIES = 1 
        
               ONLY_ORIGINAL = 2

ataniz · 2023-04-28T12:14:09Z

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

junyilou · 2023-05-07T13:51:53Z

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

I've done some quick research and seems like the TweetsByIdContext object is exactly what to find. Here:

stweet/stweet/tweets_by_ids_runner/tweets_by_id_runner.py

Lines 71 to 77 in fe34e98

    
           parsed_list = get_all_tweets_from_json(response.text) 
        
           cursors = [it for it in parsed_list if isinstance(it, Cursor)] 
        
           cursor = cursors[0] if len(cursors) > 0 else None 
        
           user_tweet_raw = [it for it in parsed_list if isinstance(it, UserTweetRaw)] 
        
           self.tweets_by_id_context.add_downloaded_tweets_count_in_request(len(user_tweet_raw)) 
        
           self.tweets_by_id_context.cursor = cursor 
        
           self._process_new_tweets_to_output(user_tweet_raw)

parsed_list is a list of UserTweetRaw and Cursor, where list of Cursors tell the runner where to go next, and list of UserTweetRaws are the downloaded tweets. The runner will judge if the scrap was finished by checking the cursor attribute of the context is not None.

If you only want the exact tweet you asked, you may force the cursor to be None, like this:

parsed_list = get_all_tweets_from_json(response.text)
# cursors = [it for it in parsed_list if isinstance(it, Cursor)]
# cursor = cursors[0] if len(cursors) > 0 else None
cursor = None # force the cursor to be None
user_tweet_raw = [it for it in parsed_list if isinstance(it, UserTweetRaw)]
self.tweets_by_id_context.add_downloaded_tweets_count_in_request(len(user_tweet_raw))
self.tweets_by_id_context.cursor = cursor
self._process_new_tweets_to_output(user_tweet_raw)

By editing this, in the run method, the while loop asks _is_end_of_scrapping, as ctx.cursor is None, the condition will not be satisfied, therefore ending the loop thus the scrap.

This will still result in an "one-level" scrapping, so if the tweet has replies, some (or all) of the replies will still be included. You may then filter the desired tweet by checking the "id_str" in the raw dictionary. But still this will dramatically increase the speed especially for tweets with large amount of interactions.

junyilou · 2023-05-07T14:03:06Z

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

I've actually achieved a much easier implementation that you don't need to change the source code.

from stweet.tweets_by_ids_runner.tweets_by_id_context import TweetsByIdContext

class DummyContext(TweetsByIdContext):
    def __setattr__(self, __name: str, __value: Any) -> None:
        if __name == "cursor":
            __value = None
        return super().__setattr__(__name, __value)

Create a DummyContext instance to use as a TweetsByIdContext instance. The only difference is when the runner tries to set the cursor attribute, the __setattr__ method will always silently edit the value to None.

Example Usage:

stweet.TweetsByIdRunner(
    tweets_by_id_task = task, 
    raw_data_outputs = [output], 
    tweets_by_ids_context = DummyContext()
).run()

ataniz · 2023-05-16T11:22:07Z

Many thanks @junyilou ! I have had success with the strategy you provided. Results still need filtering as you have foreseen, but speed is unharmed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to retrieve only a single Tweet, not subtweets #99

How to retrieve only a single Tweet, not subtweets #99

cguess commented Jan 30, 2023

oneroyalace commented Mar 5, 2023 •

edited

ataniz commented Apr 28, 2023

junyilou commented May 7, 2023 •

edited

junyilou commented May 7, 2023 •

edited

ataniz commented May 16, 2023

How to retrieve only a single Tweet, not subtweets #99

How to retrieve only a single Tweet, not subtweets #99

Comments

cguess commented Jan 30, 2023

oneroyalace commented Mar 5, 2023 • edited

ataniz commented Apr 28, 2023

junyilou commented May 7, 2023 • edited

junyilou commented May 7, 2023 • edited

ataniz commented May 16, 2023

oneroyalace commented Mar 5, 2023 •

edited

junyilou commented May 7, 2023 •

edited

junyilou commented May 7, 2023 •

edited