Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to include all tags and attrs in LinkExtractor with specified exclusions #6321

Open
User087 opened this issue Apr 27, 2024 · 6 comments

Comments

@User087
Copy link

User087 commented Apr 27, 2024

Summary

Add the option to the LinkExtractor class to consider all tags and attributes (e.g. if you pass None then consider all tags/attributes), and deny_tags and deny_attrs arguments or similar so you can additionally consider all tags and attributes with the exception of those explicitly passed.

Motivation

It allows adopting a strategy of extracting all links by default and then specifically excluding the tags and attributes you don't want considered. Currently, it seems the user has to figure out all the specific tags and attributes where they're desired links appear and explicitly pass them to tags and attrs to have them considered.

Describe alternatives you've considered

For including all tags, you could use the Selector class instead of LinkExtractor and select all e.g. href attributes regardless of which tag they appear in, e.g. response.xpath('//@href'). Using Selector results in losing the various convenient arguments in LinkExtractor and requires manually processing them with regex etc instead, and it requires manually converting relative links into absolute links when you want to use regexes that match the entire URL whereas LinkExtractor already handles that automatically.

Additional context

Any additional information about the feature request here.

@PJ1256
Copy link

PJ1256 commented May 1, 2024

I would like to try and work on this if that's ok.

@PredictiveManish
Copy link

I am trying to solve this issue.

@parthvichare
Copy link

parthvichare commented May 3, 2024

Its great challenging problem, I love to work on it

@Laerte Laerte closed this as completed May 3, 2024
@Laerte Laerte reopened this May 3, 2024
@Noman654
Copy link

Noman654 commented May 4, 2024

@Laerte, I'd like to tackle this issue! As it's my first contribution to the project, any pointers to get me started would be much appreciated.

@Laerte
Copy link
Member

Laerte commented May 4, 2024

Hi @Noman654 seems that we already have a open PR for it:

@Gallaecio
Copy link
Member

#6327 is about the first part, making None include all.

The deny part could be implemented separately by someone else, I think. There could be conflicts, but they should be easy to resolve. I do think a boolean reverse_filter parameter would be better than 2 new parameters to implement that behavior, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants