Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[refactor] fix Profile object #167

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

yimingStar
Copy link

Uable to get the original page (400 error code), get mobile twitter page indead.

  1. Change the User-Agent
  2. Fix the selector queries (Missing getting birthday, favorite and is_verified)
    UnboundLocalError: local variable 'html' referenced before assignment #166

@yeachan153
Copy link

Hey, just passing and I don't really know the code enough to do a PR. Just to let you know I tried your branch and it did not work for me, although the issue is not related with the user agent I think.

image

self.location = None

# TODO unfixed
self.birthday = html.find(".ProfileHeaderCard-birthdateText")[0].text

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not too familiar with the code (and appreciate it's not the issue you are fixing), but I would assume to make the code work it would be better to put this in a try except IndexError block?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thank you for the response!
The error is an error due to the empty list, I have added an exception catch for all elements, please try again.
I did not use specific IndexError is because the original code did not specify Errors, so I think it will be better to use the same style.

Once again thank you for disclosing the problem!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    profile = Profile(f"{user_id}")
    print(profile.to_dict())

You could try using the code by the fixed version above, I use "BarackObama" as user_id to gets the output below:

  • output
{'name': 'Barack Obama', 'username': 'BarackObama', 'birthday': None, 'biography': 'Dad, husband, President, citizen.', 'location': 'Washington, DC', 'website': 'obama.org', 'profile_photo': 'https://pbs.twimg.com/profile_images/822547732376207360/5g0FC8XX_normal.jpg', 'banner_photo': None, 'likes_count': None, 'tweets_count': 15921, 'followers_count': 122418592, 'following_count': 601466, 'is_verified': True, 'is_private': False, 'user_id': None}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version is working! Thanks a lot!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeachan153 Thank you too ~ could you please confirm the review request?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeachan153 Thank you too ~ could you please confirm the review request?

Seems valid. Could you please merge it? It's been quite a while, everything is better than the current version that is not working with profiles at all.

Copy link

@yeachan153 yeachan153 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last time I checked, the code is working! That's really nice!

I am still a bit concerned about the bare try except blocks; while keeping in the style of the code can give unexpected results in the future. I.e. if the HTML changes, then user attributes can be assigned None without raising an explicit error and could lead to confusing situations and raise cases where mistakes are caught much later because we assume that everything is working as intended.

I am not sure what a perfect solution would be. Maybe errors should be logged even if the scraper continues, so we are aware what went wrong even if the final result is outputted? Just a suggestion.

But nice work on getting everything to work :-)

Copy link
Contributor

@aldnav aldnav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix @yimingStar

Seems that soon enough we'd be needing to delegate these properties to a general function because it follows a pattern (find from selector, get first, get text, return value else default to None), and any deviation from that would have to become it's own function.
Generally it's simple and works as it is right now.

I left a suggestion. Let me know what you think.

Comment on lines 124 to 136
stats_table = None
stats = None
try:
stats_table = html.find('table.profile-stats')[0]
stats = stats_table.find('td div.statnum')
if not stats:
self.tweets_count = None
self.following_count = None
self.followers_count = None
except:
self.tweets_count = None
self.following_count = None
self.followers_count = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stats_table = None
stats = None
try:
stats_table = html.find('table.profile-stats')[0]
stats = stats_table.find('td div.statnum')
if not stats:
self.tweets_count = None
self.following_count = None
self.followers_count = None
except:
self.tweets_count = None
self.following_count = None
self.followers_count = None
try:
stats_table = html.find('table.profile-stats')[0]
stats = stats_table.find('td div.statnum')
except:
stats_table = None
stats = None

How about this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aldnav Thank you for your suggestion. Your suggestion is definitely better!
I just make an adjustment in this refactor.
The fixed is in this comment f13ec7f

@yimingStar
Copy link
Author

yimingStar commented Sep 29, 2020

@yeachan153 Genuinely agree with your suggestion for the purpose of letting the user notice the missing web elements.
So I added the exception function into this Profile class, which will print out the missing attributes.

The fixed is in this commit c7bbf55

Copy link
Contributor

@aldnav aldnav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix works. Left a minor comment. Overall LGTM. Thanks @yimingStar 👍

@@ -39,7 +40,7 @@ def __init__(self, username):
page = session.get(f"https://twitter.com/{username}", headers=headers)
self.username = username
self.__parse_profile(page)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicking, extra white spaces.

@@ -49,77 +50,128 @@ def __parse_profile(self, page):
)
except ParserError:
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't ignore a fatal error.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that was in the original package, so it's not related to this PR. Still, it should be ignored

@HJoentgen
Copy link

HJoentgen commented Mar 16, 2021

Is this solution still working?
When I use the new profile.py, I still get the same "local variable 'html' referenced before assignment" error, because of a 400 response.

So this:

from twitter_scraper import Profile
profile = Profile(f"BarackObama")
print(profile.to_dict())

leads to:

Unable to get is_private in html, exception - local variable 'html' referenced before assignment
...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants