Releases: crawler-commons/crawler-commons
Releases · crawler-commons/crawler-commons
crawler-commons-1.4
Important Changes
- Java 11 is now required to run or build crawler-commons
- the robots.txt parser (SimpleRobotRulesParser) is now compliant with RFC 9309
Full List of Changes
- [Robots.txt] Implement Robots Exclusion Protocol (REP) IETF Draft: port unit tests (sebastian-nagel, Richard Zowalla) #245, #360
- [Robots.txt] Close groups of rules as defined in RFC 9309 (kkrugler, garyillyes, jnioche, sebastian-nagel) #114, #390, #430
- [Robots.txt] Empty disallow statement not to clear other rules (sebastian-nagel, jnioche) #422, #424
- [Robots.txt] SimpleRobotRulesParser main() to follow five redirects (sebastian-nagel, jnioche) #428
- [Robots.txt] Add more spelling variants and typos of robots.txt directives (sebastian-nagel, jnioche) #425
- [Robots.txt] Document effect of rules merging in combination with multiple agent names (sebastian-nagel, Richard Zowalla) #423, #426
- [Robots.txt] Pass empty collection of agent names to select rules for any robot (wildcard user-agent name) (sebastian-nagel, Richard Zowalla) #427
- [Robots.txt] Rename default user-agent / robot name in unit tests (sebastian-nagel, Richard Zowalla) #429
- [Robots.txt] Add units test based on examples in RFC 9309 (sebastian-nagel, Richard Zowalla) #420
- [BasicNormalizer] Query parameters normalization in BasicURLNormalizer (aecio, sebastian-nagel, Richard Zowalla) #308, #421
- [Robots.txt] Deduplicate robots rules before matching (sebastian-nagel, jnioche) #416
- [Robots.txt] SimpleRobotRulesParser main to use the new API method (sebastian-nagel, jnioche) #413
- Generate JaCoCo reports when testing (jnioche) #409, #412
- Push Code Coverage to Coveralls (Richard Zowalla, jnioche) #414
- [Robots.txt] Path analyse bug with url-decode if allow/disallow path contains escaped wild-card characters (tkalistratov, sebastian-nagel, Richard Zowalla) #195, #408
- [Robots.txt] Handle allow/disallow directives containing unescaped Unicode characters (sebastian-nagel, Richard Zowalla, aecio) #389, #401
- [Robots.txt] Improve readability of robots.txt unit tests (sebastian-nagel, Richard Zowalla) #383
- Upgrade project to use Java 11 (Avi Hayun, Richard Zowalla, aecio, sebastian-nagel) #320, #376
- [Robots.txt] RFC compliance: matching user-agent names when selecting rule blocks (sebastian-nagel, Richard Zowalla) #362
- [Robots.txt] Matching user-agent names does not conform to robots.txt RFC (YossiTamari, sebastian-nagel) #192
- [Robots.txt] Improve robots check draft rfc compliance (Eduardo Jimenez) #351
- Upgrade dependencies (dependabot) #379, #384, #394, #399, #404, #419
- Upgrade Maven plugins (dependabot) #377, #381, #386, #396, #397, #398, #400, #402, #403, #405, #406, #407, #415, #418
- Javadoc: ensure Javascript search is working (sebastian-nagel, Richard Zowalla, aecio) #378, #380
crawler-commons-1.3
- [Sitemaps] Disable support for DTDs in XML sitemaps and feeds by default (Kenneth Wong) #371
- Migrate Continuous Integration from Travis to GitHub Actions (Valery Yatsynovich) #333
- Upgrade dependencies (dependabot, Richard Zowalla) #334, #339, #345, #346, #347, #350, #354, #361, #369
- Upgrade Maven plugins (dependabot, Richard Zowalla, sebastian-nagel) #328, #329, #330, #331, #335, #336, #337, #338, #340, #341, #343, #356, #363. #364, #366, #373, #374
- Update pom.xml to address Maven warnings and deprecations (sebastian-nagel, Richard Zowalla, Avi Hayun) #342
- Enable Dependabot (Valery Yatsynovich) #327
- Removes test dependency towards mockito-core (Richard Zowalla) #367
- Drops provided dependency towards servlet-api (Richard Zowalla) #368
crawler-commons-1.2
- [Sitemaps] Avoid calling java.net.URL::equals in equals method of sitemaps and extensions (sebastian-nagel) #322
- [URLs] Provide a builder class to configure the URL normalizer (aecio) #321, #324
- [URLs] Make normalization of IDNs configurable (to ASCII or Unicode) via builder (aecio, sebastian-nagel) #324
- [Sitemaps] Fix XXE vulnerability in Sitemap parser (kovyrin) #323
- [URLs] Sorting the Query Parameters (aecio) #246, #309
- [URLs] Allows to (optionally) remove common irrelevant query parameters (aecio) #309
- [Sitemaps] Allow to normalize URLs in sitemaps (murderinc, sebastian-nagel) #305
- Normalize CHANGES.txt (Avi Hayun) #270
- Readme.MD Overhaul of TOC, Installation, License (Avi Hayun) #311
- [URLs] Normalize URL without a scheme (Avi Hayun, sebastian-nagel) #271
- [Domains] EffectiveTldFinder: upgrade public suffix list / Download latest effective_tld_names.dat during Maven build (Richard Zowalla) #295, #302
- [URLs] decode percent-encoded host names (sebastian-nagel) #303
- [Sitemaps] Document options strict and allowPartial in SiteMapParser constructors (sebastian-nagel) #267
- [Robots.txt] Maximum values (crawl-delay and warnings): document and make visible (sebastian-nagel, Avi Hayun) #276
- [Sitemaps] Replace priority "NaN" by default value (sebastian-nagel) #296
- [Sitemaps] Adding duration to the map generated by VideoAttributes.asMap (evanhalley) #300
crawler-commons 1.1
crawler-commons-1.1 [maven-release-plugin] copy for tag crawler-commons-1.1
crawler-commons 1.0
crawler-commons-1.0 [maven-release-plugin] copy for tag crawler-commons-1.0
Release 0.10
- Add JAX-B dependencies to POM (jnioche) #207
- [Sitemaps] Add method to parse and iterate sitemap SiteMapParser#walkSiteMap(URL,Consumer) (Luc Boruta) #190
- [Sitemaps] Sitemap file location to ignore query part of URL (sebastian-nagel) #202
- [RSS sitemaps] Link extraction from RSS feeds fails on XML entities (sebastian-nagel) #204
- [RSS sitemaps] Resolve relative links in RSS feeds (sebastian-nagel) #203
- [RSS sitemaps] Extract links from elements (sebastian-nagel) #201
- [Sitemaps] Limit on "bad url" log messages (sebastian-nagel) #145
- EffectiveTldFinder to parse Internationalized Domain Names (sebastian-nagel) #179
- Add main() to EffectiveTldFinder (sebastian-nagel) #187
- Handle new suffixes in PaidLevelDomain (kkrugler) #183
- Remove Tika dependency (kkrugler) #199
- Improve MIME detection for sitemaps (sebastian-nagel) #200
- Make RobotRules accessible (aecio via kkrugler) #134
- SimpleRobotRulesParser: Expose MAX_WARNINGS and MAX_CRAWL_DELAY (aecio via kkrugler) #194
- Added main to SimpleRobotRulesParser for testing (sebastian-nagel) #193
- Allow for legacy URIs when checking sitemap namespaces (sebastian-nagel) #211
Release 0.9
- [Sitemaps] Removed DOM-based sitemap parser (jnioche) #177
- Incorrect domains returned by EffectiveTldFinder (sebastian-nagel) #172
- [Sitemaps] Add namespace aware DOM/SAX parsing for XML Sitemaps (Marko Milicevic, jnioche, sebastian-nagel) #176
- Upgraded Tika 1.16 (jnioche) #175
- [Sitemaps] Sitemap SAX parsing mangles target URLs (jnioche, sebastian-nagel) #169
- [Sitemaps] RSS parser ignores pubDate of link (MichealKum via kkrugler) #166
Release 0.8
- Upgrade to JDK 1.8 (lewismc) #126
- [Sitemaps] SitemapParser methods now protected (michaellavelle) #124
- [Sitemaps] Faster parsing of dates (jnioche) #117
- Upgraded Tika 1.13 (jnioche) #113
- Fix license headers (jnioche) #108
- Rename package crawlercommons.url (jnioche) #107
- Sitemap url is not extracted if user agent matches earlier in file (srwilson, kkrugler) #112
- Deprecate HTTP fetcher support (kkrugler) #92
- Added URLFilter interface + BasicURLNormalizer (jnioche) #106
- Updated tld names from publicsuffix.org (jnioche) #100
- Upgraded http-client to version 4.5.1 (aecio via kkrugler) #84
- Upgraded Tika 1.10 (jnioche) #89
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #82
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #60
- Simplify pom file (jnioche, lewismc) #77
- Upgrade javac.src.version and javac.target.version to 1.7 or 1.8 (lewismc) #93
- [Sitemaps] Not able to detect RSS feeds (yogendrasoni via kkrugler) #87
- [Robots] Added javadoc comments to the SimpleRobotRulesParser class (MichaelRoeder, kkrugler) #95
Release 0.7
- Upgrade to JDK 1.8 (lewismc) #126
- [Sitemaps] SitemapParser methods now protected (michaellavelle) #124
- [Sitemaps] Faster parsing of dates (jnioche) #117
- Upgraded Tika 1.13 (jnioche) #113
- Fix license headers (jnioche) #108
- Rename package crawlercommons.url (jnioche) #107
- Sitemap url is not extracted if user agent matches earlier in file (srwilson, kkrugler) #112
- Deprecate HTTP fetcher support (kkrugler) #92
- Added URLFilter interface + BasicURLNormalizer (jnioche) #106
- Updated tld names from publicsuffix.org (jnioche) #100
- Upgraded http-client to version 4.5.1 (aecio via kkrugler) #84
- Upgraded Tika 1.10 (jnioche) #89
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #82
- [Sitemaps] Upgrade Valid / Legal / Strict SitemapUrls (Avi Hayun) #60
- Simplify pom file (jnioche, lewismc) #77
- Upgrade javac.src.version and javac.target.version to 1.7 or 1.8 (lewismc) #93
- [Sitemaps] Not able to detect RSS feeds (yogendrasoni via kkrugler) #87
- [Robots] Added javadoc comments to the SimpleRobotRulesParser class (MichaelRoeder, kkrugler) #95
crawler-commons-0.6
Release 0.6 (27/05/2015)
- Issue 75: [Sitemaps] more robust parsing of XML elements (jnioche, kkrugler)
- Issue 76: maven-java-formatter-plugin (jnioche)
- Issue 73: Switch groupID in pom from com.google.code.crawler-commons to crawler-commons (jnioche)
- Issue 71: Upgrade to Tika 1.8 (jnioche)
- Issue 68: [Robots] Path matching should be case-sensitive (kkrugler)
- Issue 67: [Sitemaps] Parsing of lastMod date should use time portion (kkrugler)
- Issue 59: [Robots] Let SimpleRobotRules and its members implements the Serializable interface (kkrugler)
- Issue 65: [Sitemaps] Make SiteMapTool simpler by removing the Recursive flag (Avi Hayun)
- Issue 64: Upgraded to Tika 1.7 (jnioche)
- Issue 32: [Robots] Resolve relative URL for sitemaps (jnioche)
- Issue 62: [Sitemaps] Add new parseSiteMap method (jnioche)
- Issue 57: [Sitemaps] SiteMap should contain a list of SitemapUrls instead of a table of them (Avi Hayun)
- Issue 51: Upgrade httpclient to the latest version (Avi Hayun)
- Issue 61: [Sitemaps] Sitemap Parser changes the processed flag unnecessarily (Avi Hayun)
- Issue 56: [Sitemaps] SiteMap.setBaseUrl(...) causes the domain name to be lowered case which shouldn't happen (Avi Hayun)
- Issue 50: Add Fetch Report to FetchedResult (lewismc, avraham2)
- Issue 55: [Sitemaps] SitemapUrl "setPriority(String str)" should check for proper value (Avi Hayun)