-
Notifications
You must be signed in to change notification settings - Fork 10.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to include all tags and attrs in LinkExtractor with specified exclusions #6321
Comments
I would like to try and work on this if that's ok. |
I am trying to solve this issue. |
Its great challenging problem, I love to work on it |
@Laerte, I'd like to tackle this issue! As it's my first contribution to the project, any pointers to get me started would be much appreciated. |
Hi @Noman654 seems that we already have a open PR for it: |
#6327 is about the first part, making The deny part could be implemented separately by someone else, I think. There could be conflicts, but they should be easy to resolve. I do think a boolean |
Hi, I had a question and suggestion about this enhancement. Question: Suggestion:
Please let me know your thoughts on this. Or the thought process behind having all the options in a single class. @Gallaecio , the idea to keep a boolean |
I think the argument for complexity has merit, but I don’t think a class parallel to No strong opinion, though. As long as the user API is simple enough, I am open to any proposal. |
@Gallaecio Could you please expand on how the If we do not wish to create two separate classes then we would have to include two extra parameters deny_tags and deny_attrs as mentioned originally in the issue. We have different conditions while filtering,
Just trying to break down the problem and make it easier to navigate. |
I’m not sure. I just think there must be an alternative to having 2 separate link extractor classes based on whether the filtering is inclusive or exclusive.
That makes sense. But if we want maximum flexibility, we also need the ability to allow/deny specific combinations of tags / attributes. And it complicates things, with no real-life use case to justify that complexity. But I feel like I am rambling now. I have no concrete idea of how to best address the issue, to be honest. I’m more than happy to review any PR for it, though. |
Summary
Add the option to the LinkExtractor class to consider all tags and attributes (e.g. if you pass
None
then consider all tags/attributes), anddeny_tags
anddeny_attrs
arguments or similar so you can additionally consider all tags and attributes with the exception of those explicitly passed.Motivation
It allows adopting a strategy of extracting all links by default and then specifically excluding the tags and attributes you don't want considered. Currently, it seems the user has to figure out all the specific tags and attributes where they're desired links appear and explicitly pass them to
tags
andattrs
to have them considered.Describe alternatives you've considered
For including all tags, you could use the Selector class instead of LinkExtractor and select all e.g.
href
attributes regardless of which tag they appear in, e.g.response.xpath('//@href')
. Using Selector results in losing the various convenient arguments in LinkExtractor and requires manually processing them with regex etc instead, and it requires manually converting relative links into absolute links when you want to use regexes that match the entire URL whereas LinkExtractor already handles that automatically.Additional context
Any additional information about the feature request here.
The text was updated successfully, but these errors were encountered: