New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic public suffix list support. #1082
Conversation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #1082 +/- ##
==========================================
- Coverage 93.92% 93.88% -0.05%
==========================================
Files 143 144 +1
Lines 13435 13527 +92
Branches 2606 2627 +21
==========================================
+ Hits 12619 12700 +81
- Misses 480 488 +8
- Partials 336 339 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Hello! Out of curiosity, why this needs to be in dnspython? I mean, there are at least two existing libraries for Python: I wonder if there is a way improve dnspython interoperability with these? Or what is even the problem? |
I don't know that it "needs" to be in dnspython, but it goes along well with other things dnspython provides. Of the examples you cite, one hasn't been updated since 2019, and both of them use split(".") to convert a name into labels. This branch is based on code I've used for several years, and operates on dns.name.Name objects. |
I agree that split(".") is not how things should be done, definitely not. I'm just trying to find out if there is a good way to improve things for everyone and not just dnspython users, that's all. |
I did a DNS-based implementation of the PSL as a Python module (https://github.com/sse-secure-systems/psl-dns). Docs and demo are at https://publicsuffix.zone/. This is publicly queryable, i.e. doesn't require applications to bring the PSL along. Not sure if it aligns with what you have in mind for dnspython's PSL support, but using it might avoid duplicate work and in particular update problems. |
I think your solution is probably better if you don't have to make too many queries and don't mind the information leak, as it would not have up-to-date problems. I had automatic updating in an earlier draft, but I got rid of it. For my typical PSL use cases, I would probably be making a rude number of queries and/or not want the information leak, so I think there's still value in this code too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the other comments, in that I'm not sure that this needs to be in dnspython. There are no users of it here, and I'm not sure what any in-tree users would be doing.
That said, I don't object to it being added if there are reasons. I think the implementation looks fine, and only have a few minor comments.
if not _have_httpx: | ||
raise ValueError( | ||
"download_if_needed is True but httpx is not available" | ||
) | ||
if not os.path.isfile(filename): | ||
response = httpx.request("GET", url) | ||
if response.status_code == 200: | ||
with open(filename, "w") as f: | ||
f.write(response.text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some reason this couldn't use urllib.request()
, or is it doing something that requires httpx
?
node = _ExceptionNode(n) | ||
else: | ||
node = _ExactNode(n) | ||
if self.suffixes.has_key(n): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably should be n in self.suffixes
, as has_key
usage has been deprecated for years (assuming that works; NameDict
probably should be defining __contains__
, but MutableMapping
does).
Also, is it possible that some alternate PSL might contain the root domain, which would conflict with the automatically added root domain earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root is not a public suffix, and the PSL algorithm stops at the top level (with all TLDs considered public suffixes).
My interpretation of that is that the root is outside of the reach of the PSL algorithm, and its appearance in this PR's code is just a data structure convenience.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that depends on whether the intent here is to support loading the official public suffix list, or other lists which are compatible. If the latter, there is a possibility that it could contain the root, as I don't see anything in the spec that explicitly forbids it.
I don't see anything in the linked algorithm that prevents there being a rule at the root. Also, according to the algorithm:
If no rules match, the prevailing rule is "*".
So, maybe the PR should be adding a default wildcard rule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the latter, there is a possibility that it could contain the root, as I don't see anything in the spec that explicitly forbids it.
But there's no point to that, because all TLDs are public suffixes by definition (see your other quote above), so you can never climb up to the root as a TLD public suffix will always be in the way.
The code here handles that with the allow_unlisted_gtlds
parameter which defaults to True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what you mean. If there's a rule at .
, then there will never be the case that no rules match, because that rule will always match at the end. So, the text I quoted above would be irrelevant in this case.
I'm also not sure why any of this really matters. I simply suggested avoiding a misleading error. There's nothing in the spec that prevents a rule for .
, so having a rule for .
should not result in an error. Even if you think it should result in an error, it should not be a redefinition error, as it's not a redefinition.
Wondering out loud: Maybe best course of action - for the overall ecosystem - would be to add label-aware API to an existing library? I agree that |
I'll just set this one aside for now. |
No description provided.