Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement more efficient sitemap parsing. #496

Closed
ikreymer opened this issue Mar 17, 2024 · 0 comments · Fixed by #497
Closed

Implement more efficient sitemap parsing. #496

ikreymer opened this issue Mar 17, 2024 · 0 comments · Fixed by #497
Assignees

Comments

@ikreymer
Copy link
Member

The current sitemap parser reads the full XML into memory to parse it, and also doesn't handle sitemap indexes / nested sitemaps.
There is an alternative: https://www.npmjs.com/package/sitemap-stream-parser which uses SAX-based parsing and is much more efficient, but this parser has not been updated in a while. We should implement our own SAX-based parser.

@ikreymer ikreymer self-assigned this Mar 17, 2024
ikreymer added a commit that referenced this issue Mar 19, 2024
Adds a new SAX-based sitemap parser, inspired by:
https://www.npmjs.com/package/sitemap-stream-parser

Supports:
- recursively parsing sitemap indexes, using p-queue to process N at a
time (currently 5)
- `fromDate` and `toDate` filter dates, to only include URLs between the given
dates, filtering nested sitemap lists included
- async parsing, continue parsing in the background after 100 URLs
- timeout for initial fetch / first 100 URLs set to 30 seconds to avoid
slowing down the crawl
- save/load state integration: mark if sitemaps have already been parsed
in redis, serialize to save state, to avoid reparsing again. (Will
reparse if parsing did not fully finish)
- Aware of `pageLimit`, don't add URLs pass the page limit, interrupt
further parsing when at limit.
- robots.txt `sitemap:` parsing, check URL extension and mime type
- automatic detection of sitemaps for a seed URL if no sitemap url provided - first check robots.txt,
then /sitemap.xml
- tests: test for full sitemap autodetect, sitemap with limit, and sitemap from specific URL.

Fixes #496 

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
ikreymer added a commit that referenced this issue Mar 26, 2024
- support parsing sitemap urls that end in .gz with gzip decompression
- ignore extraHops for sitemap found URLs by setting to past extraHops limit
(otherwise, all sitemap URLs would be treated as links from seed page)
ikreymer added a commit that referenced this issue Mar 26, 2024
)

sitemap fixes,  follow up to #496
- support parsing sitemap urls that end in .gz with gzip decompression
- support both `application/xml` and `text/xml` as valid sitemap
content-types (add test for both)
- ignore extraHops for sitemap found URLs by setting to past extraHops
limit (otherwise, all sitemap URLs would be treated as links from seed
page)
ikreymer added a commit that referenced this issue Mar 26, 2024
sitemap improvements: gz support + application/xml + extraHops fix #511
- follow up to
#496
- support parsing sitemap urls that end in .gz with gzip decompression
- support both `application/xml` and `text/xml` as valid sitemap
content-types (add test for both)
- ignore extraHops for sitemap found URLs by setting to past extraHops
limit (otherwise, all sitemap URLs would be treated as links from seed
page)

fixes redirected seed (from #476) being counted against page limit: #509
- subtract extraSeeds when computing limit
- don't include redirect seeds in seen list when serializing
- tests: adjust saved-state-test to also check total pages when crawl is
done

fixes #508
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done!
Development

Successfully merging a pull request may close this issue.

1 participant