Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2522 Bidirectional URL exemption filter #290

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

okedoki
Copy link
Contributor

@okedoki okedoki commented Mar 6, 2018

No description provided.

Copy link
Contributor

@sebastian-nagel sebastian-nagel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean PR with correct code format and documentation! Thanks, @okedoki!

Afaics, the implementation does the following:

  1. take the lowercased host part of both from and to URL
  2. match all regex rules defined in the rules files and remove the matched part
  3. finally, if from and to host are equal return true => URL is accepted ("exempted" from ignore external host exclusion)

Is this correct?

Wouldn't be a different rule file format more suitable?

  • the leading +/- is not used
  • don't know whether this makes sense, but could also define the replace string, ev. including references to captured groups, cf. the file format used by urlnormalizer-regex

@@ -0,0 +1,33 @@
# Licensed to the Apache Software Foundation (ASF) under one or more
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Configuration files should be added as *.template. And are "instantiated" (copied) during the first compilation. Users than can modify the content without conflicts and undesired overwrites.


# Example 1:
#----------
# To exempt urls ending with image extensions, uncomment the below line
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description does not fit the following line/rule.

# Format :
#--------
# The format is same same as `regex-urlfilter.txt`.
# Each non-comment, non-blank line contains a regular expression
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description does not match the implementation.

# Example 1:
#----------
# To exempt urls ending with image extensions, uncomment the below line
-(www.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the rule starts with +-? The dot is not escaped, would also apply to wwwfinder catching wwwf.

@okedoki
Copy link
Contributor Author

okedoki commented Mar 16, 2018

@sebastian-nagel
Thanks for the suggestion to use urlnormalizer-regex. I rewrote the plugin based on this approach( now it makes sense to refactor urlnormalizer-regex and this plugin to use the same code base).

The usage is correct, at this moment we apply the same regex for both input and output url and see if they match each other.

In the future it can be improved with two separated regex for input and output.

@lewismc lewismc changed the title NUTCH-2522 NUTCH-2522 Bidirectional URL exemption filter Jan 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants