Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add headers compliant with Wikimedia's user-agent policy #44

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

marccarre
Copy link

Fixes #43.

Changelog

  1. Add optional user_agent argument to Client's constructor.
  2. If not provided, default user_agent to a value compliant with Wikimedia's user-agent policy.
  3. Use user_agent as the value for the User-Agent HTTP header.

Copy link
Owner

@dahlia dahlia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! I left my thoughts on your parch. Could you adjust few lines?

wikidata/client.py Outdated Show resolved Hide resolved
@@ -114,6 +117,13 @@ def __init__(self,
pass
opener = urllib.request._opener # type: ignore
assert isinstance(opener, urllib.request.OpenerDirector)
if not user_agent:
Copy link
Owner

@dahlia dahlia Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client() constructor already takes opener: Optional[urllib.request.OpenerDirector], and OpenerDirector allows to override its default User-Agent header. According to the docs:

OpenerDirector automatically adds a User-Agent header to every Request. To change this:

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

Therefore, IMHO it'd be better to remove user_agent parameter (as a user can customize its User-Agent by passing their own instance of OpenerDirector1) and instead check if it's using the default opener or a passed custom opener:

Suggested change
if not user_agent:
if self._using_default_opener:

Footnotes

  1. Of course, such usage might be difficult to guess. Might be good to have few lines of example code in the docs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd rather go with this approach, then feel free to close this PR as I'm not sure where you'd like this documented. 🙏🏻

My view on this is that even if it was left to the end-user to configure their own OpenerDirector, (1) this feels like the library doesn't help do things the "right way" for wikidata.org nor the end-user, and (2) this feels like you'd at least want to validate the configured User-Agent against some regex and otherwise raise a ValueError or so.
But it is only my view. 🙂

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like the library doesn't help do things the "right way" for wikidata.org nor the end-user

Fair enough, but it seems to make no difference to me if Client() constructor has user_agent parameter, as it's just a free-form text to the end-user. Rather, my intention was to incline end-users to stick with the provided default User-Agent string which complies with Wikidata's official guidance.

Please let me know if I misread your point. 😅

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point is that if users need to create a custom OpenerDirectory to set a compliant header, it is good that the wikidata library allows this custom object to be injected 👍🏻 , but it could be even better (it could improve usability) if it was providing a compliant default OpenerDirectory/header in first place.

If by:

the provided default User-Agent

you mean the default from urllib, i.e. [('User-agent', 'Python-urllib/3.9')], then it does not comply to Wikidata's official guidance, which is why there is issue #43.

If by:

the provided default User-Agent

you mean some default that the wikidata library (this repo's library) would provide, then it looks like we both share the intent of helping end-users to use a header which complies with Wikidata's policy, but I haven't understood how you'd like to provide this default to end-users in practice. 😅

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the latter.

End-users would probably be unaware of Wikidata's official guidance. Therefore, even if Client() has an option to customize their User-Agent, their User-Agent strings are still prone to ignore Wikidata's policy.

IMHO it's better not to provide any easier way to customize User-Agent header, and instead to stir up end-users to stick with our predefined User-Agent string which complies with Wikidata's policy. I guess the most of end-users have no demand to customize the header.

Copy link
Author

@marccarre marccarre Jul 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I agree with you! 💯 👍🏻
This is why, in this PR, I chose to default User-Agent to:

f'wikidata-based-bot/{__version__} (https://github.com/dahlia/wikidata) python/{python_version}'

unless specified otherwise by the end-user, i.e.:

    def __init__(self,
                # [...]
                 user_agent: Optional[str] = None) -> None:
        # [...]
        if not user_agent:
            python_version = f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}'
            user_agent = f'wikidata-based-bot/{__version__} (https://github.com/dahlia/wikidata) python/{python_version}'
        # See also: https://meta.wikimedia.org/wiki/User-Agent_policy
        opener.addheaders = {
            'User-Agent': user_agent,
        }

This way, by default, the provided User-Agent is always compliant with Wikidata's policy.
Are you happy with this approach? 🤔

  1. If so, is there anything else you'd like in this PR?

  2. If not, how would you rather approach it? Like this? (I.e. no way to provide user_agent?)

    def __init__(self,
                 # [...]
                 repr_string: Optional[str] = None) -> None:
        # [...]
        python_version = f'{sys.version_info.major}.{sys.version_info.minor}.{sys.version_info.micro}'
        user_agent = f'wikidata-based-bot/{__version__} (https://github.com/dahlia/wikidata) python/{python_version}'
        # See also: https://meta.wikimedia.org/wiki/User-Agent_policy
        opener.addheaders = {
            'User-Agent': user_agent,
        }

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the second approach! IMHO we don't need to provide such customization. If end-users request such feature in the future, I'm going to consider to provide it by that time. 🤔

Prior to this commit, when sending several requests in a row,
some requests could fail with `429/Too many requests` as the client
was not providing an `User-Agent` header compliant with Wikimedia's
user-agent policy.

See also: https://meta.wikimedia.org/wiki/User-Agent_policy
@marccarre marccarre force-pushed the issues/43-compliant-user-agent-policy branch from c803821 to 79dde4b Compare July 22, 2022 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Empty response (JSONDecodeError) when sending many requests in a row
2 participants