Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Crawl Support (Beta) #469

Merged
merged 50 commits into from Mar 23, 2024
Merged

QA Crawl Support (Beta) #469

merged 50 commits into from Mar 23, 2024

Commits on Feb 20, 2024

  1. convert driver to a class that supports crawlPage, setupPage and tear…

    …downPage, instead of a single crawlPage function.
    
    setupPage / teardownPage called for when a page is created / destroyed
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    e1e7743 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    2c0617c View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6827788 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    a00176b View commit details
    Browse the repository at this point in the history
  5. replace driver with ReplayCrawler subclass

    keep track of page resources
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    7cc741a View commit details
    Browse the repository at this point in the history
  6. load WACZ page list directly (via wabac.js ZipRangeReader)

    crawler: add overridable _addInitialSeeds() function
    crawler: store archivesDir
    reload RWP frame if not loaded in SW after 10 secs
    support max replay pages via --limit
    store 'pageinfo' records in info.warc.gz
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    540efeb View commit details
    Browse the repository at this point in the history
  7. types: fix types for WARCResourceWriter / textextract / screenshots

    make skipping first N text docs configurable, set to 2 for replaycrawler, 0 by default
    tests: fix tests due to missing text
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    db491fc View commit details
    Browse the repository at this point in the history
  8. resources pageinfo, include redirects

    reload timeout: track per page
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    a8869f7 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    cefdf52 View commit details
    Browse the repository at this point in the history
  10. add qa option to parseArgs, requires --replaySource but not --seeds

    add 'qa' entrypoint to crawler which enables qa mode
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    7787d8a View commit details
    Browse the repository at this point in the history
  11. diff work: add screenshot, text, and resource comparisons!

    (not yet storing)
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    d833e2a View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    7b8ab4b View commit details
    Browse the repository at this point in the history
  13. typo fixes

    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    222ef1d View commit details
    Browse the repository at this point in the history
  14. experiment with reloading page after initial load (disabled), add dee…

    …pLink to allow reloading
    
    resources: filter our POST requests
    loading: add check for WACZ loading if resources is not available
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    1791f16 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    e15d25d View commit details
    Browse the repository at this point in the history
  16. rename --replaySource -> --qaSource

    add --qaDebugImageDiff to enable per-page crawl.png / replay.png / diff.png output
    support qaSource from file system (via blob), as well as URL
    ikreymer committed Feb 20, 2024
    Configuration menu
    Copy the full SHA
    59382a3 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    aca1a64 View commit details
    Browse the repository at this point in the history

Commits on Feb 21, 2024

  1. replayserver: support serving sw.js directly, make RWP version config…

    …urable, using CDN version
    
    replayserver: if local file path specified, support serving local file under /source.{wacz,json}, support range requests
    ikreymer committed Feb 21, 2024
    Configuration menu
    Copy the full SHA
    bad67a0 View commit details
    Browse the repository at this point in the history
  2. replay: install RWP files directly into image on build, instead of lo…

    …ading from cdn during crawl time
    ikreymer committed Feb 21, 2024
    Configuration menu
    Copy the full SHA
    3617bb6 View commit details
    Browse the repository at this point in the history

Commits on Feb 29, 2024

  1. Configuration menu
    Copy the full SHA
    fb9de39 View commit details
    Browse the repository at this point in the history
  2. fixes for 1.0.0-beta.5 merge

    ikreymer committed Feb 29, 2024
    Configuration menu
    Copy the full SHA
    0e0d74e View commit details
    Browse the repository at this point in the history

Commits on Mar 5, 2024

  1. Configuration menu
    Copy the full SHA
    c987424 View commit details
    Browse the repository at this point in the history

Commits on Mar 7, 2024

  1. Configuration menu
    Copy the full SHA
    2d85f2d View commit details
    Browse the repository at this point in the history

Commits on Mar 8, 2024

  1. misc qa work:

    - ensure original pageid is used for qa'd pages
    - use standard ':qa' key to write qa comparison data to with --qaWriteToRedis
    - print crawl stats in qa
    - include title + favicons in qa
    ikreymer committed Mar 8, 2024
    Configuration menu
    Copy the full SHA
    c4231e5 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5c42549 View commit details
    Browse the repository at this point in the history
  3. qa: consolidate comparison data into pages data added to redis

    - add pageEntryForRedis() overridable in replaycrawler to add 'comparison' data
    - add seperate type for ComparisonData
    - add comparison data for processPageInfo, if pagestate is available
    - additional type fixes
    - remove --qaWriteToRedis, now included with page data
    ikreymer committed Mar 8, 2024
    Configuration menu
    Copy the full SHA
    4f4f7a1 View commit details
    Browse the repository at this point in the history
  4. tests: add qa comparison test:

    - run crawl with 3 pages, text/screenshots enabled
    - run qa crawl using resulting WACZ
    - enable writing pages to redis
    - verify comparison data is included in page data added to redis ':pages' key
    while crawl is running
    ikreymer committed Mar 8, 2024
    Configuration menu
    Copy the full SHA
    5a1b2a9 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    0a1018a View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    0abfaac View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    3a9ffd8 View commit details
    Browse the repository at this point in the history

Commits on Mar 11, 2024

  1. support loading multi-wacz .json files locally

    support parsing out the query string when detecting file type
    ikreymer committed Mar 11, 2024
    Configuration menu
    Copy the full SHA
    d7d6558 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2024

  1. qa crawl init: support loading pages from json file if 'pages' key is…

    … specified, otherwise load from 'resources'
    ikreymer committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    aa4ecd5 View commit details
    Browse the repository at this point in the history
  2. disable CORS for replaycrawler (for now) to allow loading any existin…

    …g WACZ from 'localhost' for replay QA
    ikreymer committed Mar 12, 2024
    Configuration menu
    Copy the full SHA
    8d0f411 View commit details
    Browse the repository at this point in the history

Commits on Mar 13, 2024

  1. cleanup

    ikreymer committed Mar 13, 2024
    Configuration menu
    Copy the full SHA
    ceffad9 View commit details
    Browse the repository at this point in the history

Commits on Mar 16, 2024

  1. Merge branch 'main' into qa-crawl-work

    bump to 1.1.0-beta.1
    ikreymer committed Mar 16, 2024
    Configuration menu
    Copy the full SHA
    251e1b3 View commit details
    Browse the repository at this point in the history

Commits on Mar 19, 2024

  1. Configuration menu
    Copy the full SHA
    e4d8388 View commit details
    Browse the repository at this point in the history
  2. readd parseArgs import

    ikreymer committed Mar 19, 2024
    Configuration menu
    Copy the full SHA
    cb435f6 View commit details
    Browse the repository at this point in the history

Commits on Mar 20, 2024

  1. Configuration menu
    Copy the full SHA
    52f80d0 View commit details
    Browse the repository at this point in the history
  2. more cleanup

    ikreymer committed Mar 20, 2024
    Configuration menu
    Copy the full SHA
    aee5af5 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    b18148b View commit details
    Browse the repository at this point in the history

Commits on Mar 21, 2024

  1. Configuration menu
    Copy the full SHA
    ce2ffca View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    f6a7dab View commit details
    Browse the repository at this point in the history

Commits on Mar 22, 2024

  1. tests: fix non-root user tests

    - disable redis retryStrategy remove disconnect for redis to avoid unclosed handles
    - Dockerfile: fix permissions on downloaded files
    - add qa_compare test to non-root test as well
    - update jest to latest
    ikreymer committed Mar 22, 2024
    Configuration menu
    Copy the full SHA
    387e269 View commit details
    Browse the repository at this point in the history
  2. tweak test ci steps

    ikreymer committed Mar 22, 2024
    Configuration menu
    Copy the full SHA
    cc5e130 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    4979d86 View commit details
    Browse the repository at this point in the history
  4. lint fix

    ikreymer committed Mar 22, 2024
    Configuration menu
    Copy the full SHA
    c8dc60d View commit details
    Browse the repository at this point in the history
  5. type fix

    ikreymer committed Mar 22, 2024
    Configuration menu
    Copy the full SHA
    3c4f552 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    ae9fdbe View commit details
    Browse the repository at this point in the history

Commits on Mar 23, 2024

  1. Configuration menu
    Copy the full SHA
    cdab557 View commit details
    Browse the repository at this point in the history
  2. bump version to 1.1.0-beta.2

    ikreymer committed Mar 23, 2024
    Configuration menu
    Copy the full SHA
    a4ef485 View commit details
    Browse the repository at this point in the history