Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to retrieve only a single Tweet, not subtweets #99

Open
cguess opened this issue Jan 30, 2023 · 5 comments
Open

How to retrieve only a single Tweet, not subtweets #99

cguess opened this issue Jan 30, 2023 · 5 comments

Comments

@cguess
Copy link

cguess commented Jan 30, 2023

Running the sample code for getting a tweet with a single id actually scrapes dozens of tweets taking 30+ seconds and return a massive List. Is it possible to just get the first tweet object and then stop running so it can be more time efficient?

I've tried messing around with the context, but I can't seem to wrap my head around how that's actually used in this project. I could perhaps be being dense however, and any clarification would be greatly appreciated.

@oneroyalace
Copy link

oneroyalace commented Mar 5, 2023

@cguess is this what you were looking for?

search_tweets_task = stweet.SearchTweetsTask(from_username=tweet_author, replies_filter=RepliesFilter.ONLY_ORIGINAL)

replies_filter: Optional[RepliesFilter]

class RepliesFilter(enum.Enum):
"""Domain RepliesFilter enum class."""
ONLY_REPLIES = 1
ONLY_ORIGINAL = 2

@ataniz
Copy link

ataniz commented Apr 28, 2023

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

@junyilou
Copy link

junyilou commented May 7, 2023

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

I've done some quick research and seems like the TweetsByIdContext object is exactly what to find. Here:

parsed_list = get_all_tweets_from_json(response.text)
cursors = [it for it in parsed_list if isinstance(it, Cursor)]
cursor = cursors[0] if len(cursors) > 0 else None
user_tweet_raw = [it for it in parsed_list if isinstance(it, UserTweetRaw)]
self.tweets_by_id_context.add_downloaded_tweets_count_in_request(len(user_tweet_raw))
self.tweets_by_id_context.cursor = cursor
self._process_new_tweets_to_output(user_tweet_raw)

parsed_list is a list of UserTweetRaw and Cursor, where list of Cursors tell the runner where to go next, and list of UserTweetRaws are the downloaded tweets. The runner will judge if the scrap was finished by checking the cursor attribute of the context is not None.

If you only want the exact tweet you asked, you may force the cursor to be None, like this:

parsed_list = get_all_tweets_from_json(response.text)
# cursors = [it for it in parsed_list if isinstance(it, Cursor)]
# cursor = cursors[0] if len(cursors) > 0 else None
cursor = None # force the cursor to be None
user_tweet_raw = [it for it in parsed_list if isinstance(it, UserTweetRaw)]
self.tweets_by_id_context.add_downloaded_tweets_count_in_request(len(user_tweet_raw))
self.tweets_by_id_context.cursor = cursor
self._process_new_tweets_to_output(user_tweet_raw)

By editing this, in the run method, the while loop asks _is_end_of_scrapping, as ctx.cursor is None, the condition will not be satisfied, therefore ending the loop thus the scrap.

This will still result in an "one-level" scrapping, so if the tweet has replies, some (or all) of the replies will still be included. You may then filter the desired tweet by checking the "id_str" in the raw dictionary. But still this will dramatically increase the speed especially for tweets with large amount of interactions.

@junyilou
Copy link

junyilou commented May 7, 2023

@oneroyalace are you aware of any ways to apply such filters for TweetsByIdTask? I have not found any options in the code and tried playing with the context but it seems to work as a counter for stats and doesn't seem to provide any options for limiting the number of replies/requests.

I've actually achieved a much easier implementation that you don't need to change the source code.

from stweet.tweets_by_ids_runner.tweets_by_id_context import TweetsByIdContext

class DummyContext(TweetsByIdContext):
    def __setattr__(self, __name: str, __value: Any) -> None:
        if __name == "cursor":
            __value = None
        return super().__setattr__(__name, __value)

Create a DummyContext instance to use as a TweetsByIdContext instance. The only difference is when the runner tries to set the cursor attribute, the __setattr__ method will always silently edit the value to None.

Example Usage:

stweet.TweetsByIdRunner(
    tweets_by_id_task = task, 
    raw_data_outputs = [output], 
    tweets_by_ids_context = DummyContext()
).run()

@ataniz
Copy link

ataniz commented May 16, 2023

Many thanks @junyilou ! I have had success with the strategy you provided. Results still need filtering as you have foreseen, but speed is unharmed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants