Option to include all tags and attrs in LinkExtractor with specified exclusions #6321

User087 · 2024-04-27T12:09:30Z

Summary

Add the option to the LinkExtractor class to consider all tags and attributes (e.g. if you pass None then consider all tags/attributes), and deny_tags and deny_attrs arguments or similar so you can additionally consider all tags and attributes with the exception of those explicitly passed.

Motivation

It allows adopting a strategy of extracting all links by default and then specifically excluding the tags and attributes you don't want considered. Currently, it seems the user has to figure out all the specific tags and attributes where they're desired links appear and explicitly pass them to tags and attrs to have them considered.

Describe alternatives you've considered

For including all tags, you could use the Selector class instead of LinkExtractor and select all e.g. href attributes regardless of which tag they appear in, e.g. response.xpath('//@href'). Using Selector results in losing the various convenient arguments in LinkExtractor and requires manually processing them with regex etc instead, and it requires manually converting relative links into absolute links when you want to use regexes that match the entire URL whereas LinkExtractor already handles that automatically.

Additional context

Any additional information about the feature request here.

The text was updated successfully, but these errors were encountered:

PJ1256 · 2024-05-01T18:49:56Z

I would like to try and work on this if that's ok.

PredictiveManish · 2024-05-03T06:22:34Z

I am trying to solve this issue.

parthvichare · 2024-05-03T13:55:42Z

Its great challenging problem, I love to work on it

Noman654 · 2024-05-04T18:39:30Z

@Laerte, I'd like to tackle this issue! As it's my first contribution to the project, any pointers to get me started would be much appreciated.

Laerte · 2024-05-04T19:48:05Z

Hi @Noman654 seems that we already have a open PR for it:

Issue #6321: Link extractor all tags and attributes option #6327

Gallaecio · 2024-05-06T09:47:13Z

#6327 is about the first part, making None include all.

The deny part could be implemented separately by someone else, I think. There could be conflicts, but they should be easy to resolve. I do think a boolean reverse_filter parameter would be better than 2 new parameters to implement that behavior, though.

Gallaecio added enhancement good first issue labels Apr 29, 2024

PJ1256 mentioned this issue May 2, 2024

Issue #6321: Link extractor all tags and attributes option #6327

Open

Laerte closed this as completed May 3, 2024

Laerte reopened this May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to include all tags and attrs in LinkExtractor with specified exclusions #6321

Option to include all tags and attrs in LinkExtractor with specified exclusions #6321

User087 commented Apr 27, 2024

PJ1256 commented May 1, 2024

PredictiveManish commented May 3, 2024

parthvichare commented May 3, 2024 •

edited

Noman654 commented May 4, 2024

Laerte commented May 4, 2024

Gallaecio commented May 6, 2024

Option to include all tags and attrs in LinkExtractor with specified exclusions #6321

Option to include all tags and attrs in LinkExtractor with specified exclusions #6321

Comments

User087 commented Apr 27, 2024

Summary

Motivation

Describe alternatives you've considered

Additional context

PJ1256 commented May 1, 2024

PredictiveManish commented May 3, 2024

parthvichare commented May 3, 2024 • edited

Noman654 commented May 4, 2024

Laerte commented May 4, 2024

Gallaecio commented May 6, 2024

parthvichare commented May 3, 2024 •

edited