New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Option to include all tags and attrs in LinkExtractor with specified exclusions #6321
Comments
I would like to try and work on this if that's ok. |
I am trying to solve this issue. |
Its great challenging problem, I love to work on it |
@Laerte, I'd like to tackle this issue! As it's my first contribution to the project, any pointers to get me started would be much appreciated. |
Hi @Noman654 seems that we already have a open PR for it: |
#6327 is about the first part, making The deny part could be implemented separately by someone else, I think. There could be conflicts, but they should be easy to resolve. I do think a boolean |
Summary
Add the option to the LinkExtractor class to consider all tags and attributes (e.g. if you pass
None
then consider all tags/attributes), anddeny_tags
anddeny_attrs
arguments or similar so you can additionally consider all tags and attributes with the exception of those explicitly passed.Motivation
It allows adopting a strategy of extracting all links by default and then specifically excluding the tags and attributes you don't want considered. Currently, it seems the user has to figure out all the specific tags and attributes where they're desired links appear and explicitly pass them to
tags
andattrs
to have them considered.Describe alternatives you've considered
For including all tags, you could use the Selector class instead of LinkExtractor and select all e.g.
href
attributes regardless of which tag they appear in, e.g.response.xpath('//@href')
. Using Selector results in losing the various convenient arguments in LinkExtractor and requires manually processing them with regex etc instead, and it requires manually converting relative links into absolute links when you want to use regexes that match the entire URL whereas LinkExtractor already handles that automatically.Additional context
Any additional information about the feature request here.
The text was updated successfully, but these errors were encountered: