Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-1806 Delegate processing of URL domains to crawler-common #816

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

sebastian-nagel
Copy link
Contributor

and NUTCH-1942 Remove TopLevelDomain

  • use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the "org.apache.nutch.util.domain" package

  • adapt and extend unit tests

    • add tests for URLUtil.getTopLevelDomainName(url)
    • reflect changes to the public suffix list since 2014 ("xyz" is now a public suffix / ICANN suffix)
    • adapt to minor API changes
      • URLUtil.getDomainName(url) returns the host name in case no valid public suffix is found
      • for Unicode suffixes and TLDs the methods URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now return the ASCII representation
    • add unit tests for host names with trailing dot ("www.apache.org.")
    • add add unit test for URLs without host/domain (cf. NUTCH-2450)unit test for URLs without host/domain (cf. NUTCH-2450)
  • update and complete Javadoc

  • update DomainStatistics, TLDIndexingFilter and domain URL filters to use the updated methods in URLUtil

  • remove the class TLDScoringFilter. The configuration is bound to the domain-suffixes.xml which wasn't maintained anymore and is now removed

  • remove package org.apache.nutch.util.domain

  • move DomainStatistics to org.apache.nutch.util

  • remove configuration files of domain utils

- add unit test for URLs without host/domain (cf. NUTCH-2450)
- add unit tests for host names with trailing dot ("www.apache.org.")
- use methods from crawler-commons' EffectiveTldFinder in URLUtil
  replacing classed and methods from the org.apache.nutch.util.domain
  package
- adapt and extend unit tests
  - add tests for URLUtil.getTopLevelDomainName(url)
  - changes to the public suffix list since 2014
    ("xyz" is now a public suffix / ICANN suffix)
  - minor API changes
    - URLUtil.getDomainName(url) returns the host name
      in case no valid public suffix is found
    - for Unicode suffixes and TLDs the methods
      URLUtil.getDomainSuffix(url) resp.
      URLUtil.getTopLevelDomainName(url) now return
      the ASCII representation
- complete Javadoc
NUTCH-1942 Remove TopLevelDomain
- update DomainStatistics, TLDIndexingFilter and domain URL filters
  to use the updated methods in URLUtil
- remove TLDScoringFilter
- remove package org.apache.nutch.util.domain
- move DomainStatistics to org.apache.nutch.util
- remove configuration files of domain utils
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant