Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SAX-based sitemap parser #497

Merged
merged 18 commits into from Mar 19, 2024
Merged

SAX-based sitemap parser #497

merged 18 commits into from Mar 19, 2024

Commits on Mar 16, 2024

  1. sax based sitemap parser

    ikreymer committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    5075cbe View commit details
    Browse the repository at this point in the history

Commits on Mar 17, 2024

  1. Configuration menu
    Copy the full SHA
    daff9ef View commit details
    Browse the repository at this point in the history
  2. support sitemap index parsing, nested sitemap parsing

    better error handling
    ikreymer committed Mar 17, 2024
    Configuration menu
    Copy the full SHA
    9553d42 View commit details
    Browse the repository at this point in the history
  3. remove old sitemapper

    ikreymer committed Mar 17, 2024
    Configuration menu
    Copy the full SHA
    aa02067 View commit details
    Browse the repository at this point in the history
  4. refactor to use single queue for nested sitemaps

    continue fetching sitemaps async
    include nsted sitemaps queued count in logging
    store if sitemap parsing was finished in redis, include in save/load, don't reparse
    if fully parsed
    ikreymer committed Mar 17, 2024
    Configuration menu
    Copy the full SHA
    f98f338 View commit details
    Browse the repository at this point in the history

Commits on Mar 18, 2024

  1. support passing in pageLimit, interrupting additional parsing when li…

    …mit is hit.
    
    when at limit, don't report any errors, close xml stream and pass end event to the root
    ikreymer committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    af6e65d View commit details
    Browse the repository at this point in the history
  2. logging tweaks

    ikreymer committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    3b61f7b View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    8470687 View commit details
    Browse the repository at this point in the history
  4. Improve user guide docs

    tw4l committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    e2024f5 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    c39f4bd View commit details
    Browse the repository at this point in the history
  6. support parsing sitemap from robots.txt - if sitemap url

    ends in /robots.txt, parse as text
    ikreymer committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    7e9fe50 View commit details
    Browse the repository at this point in the history
  7. Update src/util/sitemapper.ts

    Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
    ikreymer and tw4l committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    17d4a65 View commit details
    Browse the repository at this point in the history
  8. Update src/util/sitemapper.ts

    Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
    ikreymer and tw4l committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    2fe9e19 View commit details
    Browse the repository at this point in the history
  9. refactor sitemap detection:

    if just --sitemap/--useSitemap given, then
    - first try parsing <seed>/robots.txt
    - then try parsing <seed>/sitemap.xml
    if sitemap url specified, then:
    - fetch and detect content type, and parse as either xml or robots.txt based on extension
    and content-type
    ikreymer committed Mar 18, 2024
    Configuration menu
    Copy the full SHA
    621f620 View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    a21979d View commit details
    Browse the repository at this point in the history

Commits on Mar 19, 2024

  1. Configuration menu
    Copy the full SHA
    199fd4e View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    cb18d64 View commit details
    Browse the repository at this point in the history
  3. store sitemapDoneKey

    ikreymer committed Mar 19, 2024
    Configuration menu
    Copy the full SHA
    2497189 View commit details
    Browse the repository at this point in the history