Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Content of tweet includes non written mentions #992

Open
enzoferey opened this issue Jun 30, 2023 · 4 comments
Open

Content of tweet includes non written mentions #992

enzoferey opened this issue Jun 30, 2023 · 4 comments
Labels
bug Something isn't working module:twitter

Comments

@enzoferey
Copy link

enzoferey commented Jun 30, 2023

Describe the bug

Then scrapping the following tweet, the content returned starts like "@GitHubCopilot @tabnine @Replit @vercel Have you tried them ?" instead of just "Have you tried them ?" as expected.

How to reproduce

Use the TwitterTweetScraper and pass the tweet id 1674020720458776576.

Expected behaviour

There should be no non-written mentions at the beginning of the content.

Screenshots and recordings

No response

Operating system

macOS 13.4.1

Python version: output of python3 --version

3.9

snscrape version: output of snscrape --version

0.7.0.20230622

Scraper

TwitterTweetScraper

How are you using snscrape?

Module (import snscrape.modules.something in Python code)

Backtrace

No response

Log output

No response

Dump of locals

No response

Additional context

No response

@enzoferey enzoferey added the bug Something isn't working label Jun 30, 2023
@JustAnotherArchivist
Copy link
Owner

These mentions are technically part of the tweet text. This is exactly what Twitter returns:

...['tweet_results']['result']['legacy']['full_text'] = '@GitHubCopilot @tabnine @Replit @vercel Have you tried them ? What’s your opinion ? We read you 👀'

There is however also a display_text_range field. That should probably be taken into account for the renderedContent.

@enzoferey
Copy link
Author

enzoferey commented Jun 30, 2023

Thanks for pointing it out @JustAnotherArchivist 🙏🏻

I did not realize that all accounts mentioned in a tweet are internally included in its replies (since you get notified about replies it makes sense 😄).

This might be a good opportunity for me to task as well about the differences of content, renderedContent, and rawContent ?

@JustAnotherArchivist
Copy link
Owner

Forget that content exists; it's a deprecated alias from the early days that will be removed eventually. (It emits a warning if you try to use it.)

rawContent is the exact tweet text Twitter returns, while renderedContent is (roughly) the text as it would be rendered on Twitter's web interface. The only difference there currently is the replacement of links, so it doesn't exactly match. For example, replies start with a mention of the replied-to user, which gets rendered separately on the web interface.

@enzoferey
Copy link
Author

Links replacement you mean the https://t.co ones instead of the originals right? I’m using Puppeteer to navigate those and get the actual URLs.

So as far as I understood, I should be using renderedContent and there needs to be fix for the fact it should not include mentions on replies. Is this right ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working module:twitter
Projects
None yet
Development

No branches or pull requests

2 participants