SAX-based sitemap parser #497

ikreymer · 2024-03-17T04:00:34Z

Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser

Supports:

recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5)
from and to filter dates, to only include URLs between the given dates
async parsing, continue parsing in the background after 100 URLs
timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl
save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish)
Aware of pageLimit, don't add URLs pass the page limit, interrupt further parsing when at limit.

Fixes #496

TODO: Still need tests

better error handling

continue fetching sitemaps async include nsted sitemaps queued count in logging store if sitemap parsing was finished in redis, include in save/load, don't reparse if fully parsed

…mit is hit. when at limit, don't report any errors, close xml stream and pass end event to the root

tw4l · 2024-03-18T14:18:00Z

Proper review coming, but in the meantime, this might also be a nice time to add a section on sitemap parsing to the crawler docs (happy to push that myself).

tw4l

Nicely done! Left a few nitpicky comments, but I've tested this now on a good number of sitemaps, including some sitemap indices/nested sitemaps, and it's working great!

I pushed a commit with a docs update. Once we get some tests here, I think it's good to go :)

src/util/sitemapper.ts

src/util/state.ts

ends in /robots.txt, parse as text

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

if just --sitemap/--useSitemap given, then - first try parsing <seed>/robots.txt - then try parsing <seed>/sitemap.xml if sitemap url specified, then: - fetch and detect content type, and parse as either xml or robots.txt based on extension and content-type

… and specific URL

ikreymer added 5 commits March 16, 2024 16:24

sax based sitemap parser

5075cbe

single sitemap with callbacks

daff9ef

support sitemap index parsing, nested sitemap parsing

9553d42

better error handling

remove old sitemapper

aa02067

refactor to use single queue for nested sitemaps

f98f338

continue fetching sitemaps async include nsted sitemaps queued count in logging store if sitemap parsing was finished in redis, include in save/load, don't reparse if fully parsed

ikreymer marked this pull request as draft March 17, 2024 16:34

ikreymer requested a review from tw4l March 17, 2024 16:34

ikreymer added 2 commits March 17, 2024 21:32

support passing in pageLimit, interrupting additional parsing when li…

af6e65d

…mit is hit. when at limit, don't report any errors, close xml stream and pass end event to the root

logging tweaks

3b61f7b

ikreymer marked this pull request as ready for review March 18, 2024 07:22

Add details on sitemap parsing to docs user guide

8470687

tw4l approved these changes Mar 18, 2024

View reviewed changes

src/util/sitemapper.ts Outdated Show resolved Hide resolved

src/util/sitemapper.ts Outdated Show resolved Hide resolved

src/util/sitemapper.ts Outdated Show resolved Hide resolved

src/util/state.ts Outdated Show resolved Hide resolved

tw4l and others added 10 commits March 18, 2024 12:55

Improve user guide docs

e2024f5

simplify, remove nested sitemap, just handle in a single object

c39f4bd

support parsing sitemap from robots.txt - if sitemap url

7e9fe50

ends in /robots.txt, parse as text

Update src/util/sitemapper.ts

17d4a65

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

Update src/util/sitemapper.ts

2fe9e19

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

refactor sitemap detection:

621f620

if just --sitemap/--useSitemap given, then - first try parsing <seed>/robots.txt - then try parsing <seed>/sitemap.xml if sitemap url specified, then: - fetch and detect content type, and parse as either xml or robots.txt based on extension and content-type

add last-modified date check to sitemap parser as well

a21979d

support retries to fetch() if Retry-After header provided

199fd4e

tests: add sitemap-parse-text for testing auto-detection, with limits…

cb18d64

… and specific URL

store sitemapDoneKey

2497189

ikreymer merged commit 5605353 into main Mar 19, 2024
4 checks passed

ikreymer deleted the sax-sitemap-parser branch March 19, 2024 02:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SAX-based sitemap parser #497

SAX-based sitemap parser #497

ikreymer commented Mar 17, 2024 •

edited

tw4l commented Mar 18, 2024 •

edited

tw4l left a comment

SAX-based sitemap parser #497

SAX-based sitemap parser #497

Conversation

ikreymer commented Mar 17, 2024 • edited

tw4l commented Mar 18, 2024 • edited

tw4l left a comment

Choose a reason for hiding this comment

ikreymer commented Mar 17, 2024 •

edited

tw4l commented Mar 18, 2024 •

edited