Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Option to include all tags and attrs in LinkExtractor with specified exclusions #6321

Open
User087 opened this issue Apr 27, 2024 · 10 comments

Comments

@User087
Copy link

User087 commented Apr 27, 2024

Summary

Add the option to the LinkExtractor class to consider all tags and attributes (e.g. if you pass None then consider all tags/attributes), and deny_tags and deny_attrs arguments or similar so you can additionally consider all tags and attributes with the exception of those explicitly passed.

Motivation

It allows adopting a strategy of extracting all links by default and then specifically excluding the tags and attributes you don't want considered. Currently, it seems the user has to figure out all the specific tags and attributes where they're desired links appear and explicitly pass them to tags and attrs to have them considered.

Describe alternatives you've considered

For including all tags, you could use the Selector class instead of LinkExtractor and select all e.g. href attributes regardless of which tag they appear in, e.g. response.xpath('//@href'). Using Selector results in losing the various convenient arguments in LinkExtractor and requires manually processing them with regex etc instead, and it requires manually converting relative links into absolute links when you want to use regexes that match the entire URL whereas LinkExtractor already handles that automatically.

Additional context

Any additional information about the feature request here.

@PJ1256
Copy link

PJ1256 commented May 1, 2024

I would like to try and work on this if that's ok.

@PredictiveManish
Copy link

I am trying to solve this issue.

@parthvichare
Copy link

parthvichare commented May 3, 2024

Its great challenging problem, I love to work on it

@Laerte Laerte closed this as completed May 3, 2024
@Laerte Laerte reopened this May 3, 2024
@Noman654
Copy link

Noman654 commented May 4, 2024

@Laerte, I'd like to tackle this issue! As it's my first contribution to the project, any pointers to get me started would be much appreciated.

@Laerte
Copy link
Member

Laerte commented May 4, 2024

Hi @Noman654 seems that we already have a open PR for it:

@Gallaecio
Copy link
Member

#6327 is about the first part, making None include all.

The deny part could be implemented separately by someone else, I think. There could be conflicts, but they should be easy to resolve. I do think a boolean reverse_filter parameter would be better than 2 new parameters to implement that behavior, though.

@ShubhGohil
Copy link

Hi, I had a question and suggestion about this enhancement.

Question:
Wouldn't making the LinkExtractor extract all the tags and attributes as its default behaviour, and at the same time having the options to select and omit specific tags and attributes make it a bit complex?

Suggestion:
Rather can we have separate classes for specific jobs.

  1. One class extracts all the links and attributes as its default behavior and gives the option to filter out the tags and attributes not required.
  2. Other class which accepts a list of tags and attributes to be extracted.

Please let me know your thoughts on this. Or the thought process behind having all the options in a single class.

@Gallaecio , the idea to keep a boolean reverse_filter would increase the conditional statements in the class but can be implemented.

@Gallaecio
Copy link
Member

I think the argument for complexity has merit, but I don’t think a class parallel to LinkExtractor would be the way to keep complexity manageable. Maybe we could create a LinkExtractorFilter class, allow passing as instance of that as the value of the filter parameter, and move all filtering-related complexity to that class.

No strong opinion, though. As long as the user API is simple enough, I am open to any proposal.

@ShubhGohil
Copy link

ShubhGohil commented Jun 7, 2024

@Gallaecio Could you please expand on how the LinkExtractorFilter class would work?

If we do not wish to create two separate classes then we would have to include two extra parameters deny_tags and deny_attrs as mentioned originally in the issue.

We have different conditions while filtering,

  1. if we want to include some tags and include some attributes.

The above condition is the current default behaviour

  1. if we want to extract all the tags and all the attributes
  2. if we want to exclude some tags and exclude some attribute.

The above two conditions can be infused by introducing a boolean filter parameter. Condition 2 would become the default behaviour in this case

  1. if we want to include some tags and exclude some attributes from within those elements
  2. if we want to exclude some tags and include some attributes from the elements extracted

The above two conditions would need the use of two different parameters deny_tags and deny_attrs

Just trying to break down the problem and make it easier to navigate.

@Gallaecio
Copy link
Member

Could you please expand on how the LinkExtractorFilter class would work?

I’m not sure. I just think there must be an alternative to having 2 separate link extractor classes based on whether the filtering is inclusive or exclusive.

if we want to include some tags and exclude some attributes from within those elements
if we want to exclude some tags and include some attributes from the elements extracted

That makes sense. But if we want maximum flexibility, we also need the ability to allow/deny specific combinations of tags / attributes. And it complicates things, with no real-life use case to justify that complexity.

But I feel like I am rambling now. I have no concrete idea of how to best address the issue, to be honest. I’m more than happy to review any PR for it, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants