New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SAX-based sitemap parser #497
Conversation
better error handling
continue fetching sitemaps async include nsted sitemaps queued count in logging store if sitemap parsing was finished in redis, include in save/load, don't reparse if fully parsed
…mit is hit. when at limit, don't report any errors, close xml stream and pass end event to the root
Proper review coming, but in the meantime, this might also be a nice time to add a section on sitemap parsing to the crawler docs (happy to push that myself). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nicely done! Left a few nitpicky comments, but I've tested this now on a good number of sitemaps, including some sitemap indices/nested sitemaps, and it's working great!
I pushed a commit with a docs update. Once we get some tests here, I think it's good to go :)
ends in /robots.txt, parse as text
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
if just --sitemap/--useSitemap given, then - first try parsing <seed>/robots.txt - then try parsing <seed>/sitemap.xml if sitemap url specified, then: - fetch and detect content type, and parse as either xml or robots.txt based on extension and content-type
… and specific URL
Adds a new SAX-based sitemap parser, inspired by: https://www.npmjs.com/package/sitemap-stream-parser
Supports:
from
andto
filter dates, to only include URLs between the given datespageLimit
, don't add URLs pass the page limit, interrupt further parsing when at limit.Fixes #496
TODO: Still need tests