Implement more efficient sitemap parsing. #496

ikreymer · 2024-03-17T03:57:12Z

The current sitemap parser reads the full XML into memory to parse it, and also doesn't handle sitemap indexes / nested sitemaps.
There is an alternative: https://www.npmjs.com/package/sitemap-stream-parser which uses SAX-based parsing and is much more efficient, but this parser has not been updated in a while. We should implement our own SAX-based parser.

Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser Supports: - recursively parsing sitemap indexes, using p-queue to process N at a time (currently 5) - `fromDate` and `toDate` filter dates, to only include URLs between the given dates, filtering nested sitemap lists included - async parsing, continue parsing in the background after 100 URLs - timeout for initial fetch / first 100 URLs set to 30 seconds to avoid slowing down the crawl - save/load state integration: mark if sitemaps have already been parsed in redis, serialize to save state, to avoid reparsing again. (Will reparse if parsing did not fully finish) - Aware of `pageLimit`, don't add URLs pass the page limit, interrupt further parsing when at limit. - robots.txt `sitemap:` parsing, check URL extension and mime type - automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt, then /sitemap.xml - tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL. Fixes #496 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

- support parsing sitemap urls that end in .gz with gzip decompression - ignore extraHops for sitemap found URLs by setting to past extraHops limit (otherwise, all sitemap URLs would be treated as links from seed page)

) sitemap fixes, follow up to #496 - support parsing sitemap urls that end in .gz with gzip decompression - support both `application/xml` and `text/xml` as valid sitemap content-types (add test for both) - ignore extraHops for sitemap found URLs by setting to past extraHops limit (otherwise, all sitemap URLs would be treated as links from seed page)

sitemap improvements: gz support + application/xml + extraHops fix #511 - follow up to #496 - support parsing sitemap urls that end in .gz with gzip decompression - support both `application/xml` and `text/xml` as valid sitemap content-types (add test for both) - ignore extraHops for sitemap found URLs by setting to past extraHops limit (otherwise, all sitemap URLs would be treated as links from seed page) fixes redirected seed (from #476) being counted against page limit: #509 - subtract extraSeeds when computing limit - don't include redirect seeds in seen list when serializing - tests: adjust saved-state-test to also check total pages when crawl is done fixes #508

ikreymer self-assigned this Mar 17, 2024

This was referenced Mar 18, 2024

SAX-based sitemap parser #497

Merged

[Bug]: No ads in replay on some sites eventhough the ads are shown in the brave profile or online webrecorder/browsertrix#1606

Open

ikreymer closed this as completed in #497 Mar 19, 2024

ikreymer mentioned this issue Mar 26, 2024

sitemap improvements: gz support + application/xml + extraHops fix #511

Merged

ikreymer mentioned this issue Mar 26, 2024

Fixes from 1.0.3 release -> main #517

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement more efficient sitemap parsing. #496

Implement more efficient sitemap parsing. #496

ikreymer commented Mar 17, 2024

Implement more efficient sitemap parsing. #496

Implement more efficient sitemap parsing. #496

Comments

ikreymer commented Mar 17, 2024