Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAX-based sitemap parser #497

Merged
merged 18 commits into from Mar 19, 2024
Merged

SAX-based sitemap parser #497

merged 18 commits into from Mar 19, 2024

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Mar 17, 2024

Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser

Supports:

  • recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5)
  • from and to filter dates, to only include URLs between the given dates
  • async parsing, continue parsing in the background after 100 URLs
  • timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl
  • save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish)
  • Aware of pageLimit, don't add URLs pass the page limit, interrupt further parsing when at limit.

Fixes #496

TODO: Still need tests

continue fetching sitemaps async
include nsted sitemaps queued count in logging
store if sitemap parsing was finished in redis, include in save/load, don't reparse
if fully parsed
@ikreymer ikreymer marked this pull request as draft March 17, 2024 16:34
@ikreymer ikreymer requested a review from tw4l March 17, 2024 16:34
…mit is hit.

when at limit, don't report any errors, close xml stream and pass end event to the root
@ikreymer ikreymer marked this pull request as ready for review March 18, 2024 07:22
@tw4l
Copy link
Contributor

tw4l commented Mar 18, 2024

Proper review coming, but in the meantime, this might also be a nice time to add a section on sitemap parsing to the crawler docs (happy to push that myself).

Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nicely done! Left a few nitpicky comments, but I've tested this now on a good number of sitemaps, including some sitemap indices/nested sitemaps, and it's working great!

I pushed a commit with a docs update. Once we get some tests here, I think it's good to go :)

src/util/sitemapper.ts Outdated Show resolved Hide resolved
src/util/sitemapper.ts Outdated Show resolved Hide resolved
src/util/sitemapper.ts Outdated Show resolved Hide resolved
src/util/state.ts Outdated Show resolved Hide resolved
tw4l and others added 10 commits March 18, 2024 12:55
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
if just --sitemap/--useSitemap given, then
- first try parsing <seed>/robots.txt
- then try parsing <seed>/sitemap.xml
if sitemap url specified, then:
- fetch and detect content type, and parse as either xml or robots.txt based on extension
and content-type
@ikreymer ikreymer merged commit 5605353 into main Mar 19, 2024
4 checks passed
@ikreymer ikreymer deleted the sax-sitemap-parser branch March 19, 2024 02:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement more efficient sitemap parsing.
2 participants