QA Crawl Support (Beta) #469

ikreymer · 2024-02-20T17:40:58Z

Initial support for QA crawling, crawling over an existing replay to generate QA/comparison data.
Can be deployed with webrecorder/browsertrix-crawler qa entrypoint.

Requires --qaSource, pointing to WACZ or multi-WACZ json that will be QAd.

Also supports --qaRedisKey where QA comparison data will be pushed, if specified.
Supports --qaDebugImageDiff for outputting crawl / replay/ diff images.

The data pushed to redis is {"url": <page url>", "comparison": <...>"} where comparison is:

  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };

…downPage, instead of a single crawlPage function. setupPage / teardownPage called for when a page is created / destroyed

keep track of page resources

crawler: add overridable _addInitialSeeds() function crawler: store archivesDir reload RWP frame if not loaded in SW after 10 secs support max replay pages via --limit store 'pageinfo' records in info.warc.gz

make skipping first N text docs configurable, set to 2 for replaycrawler, 0 by default tests: fix tests due to missing text

reload timeout: track per page

add 'qa' entrypoint to crawler which enables qa mode

(not yet storing)

…pLink to allow reloading resources: filter our POST requests loading: add check for WACZ loading if resources is not available

add --qaDebugImageDiff to enable per-page crawl.png / replay.png / diff.png output support qaSource from file system (via blob), as well as URL

…edis

…urable, using CDN version replayserver: if local file path specified, support serving local file under /source.{wacz,json}, support range requests

tw4l

This is looking good! Tested with local WACZs as well as serving files via an http server.

I've left some comments and suggestions (suggestions are mostly adding some logging and removing some commented out bits that are no longer necessary).

We could use some tests - possibly an integration test to start that runs a crawl, then QAs the crawl, and checks the outputs?

src/replaycrawler.ts

src/util/state.ts

…ading from cdn during crawl time

tw4l · 2024-02-21T21:01:34Z

Could we also add the page id to the data pushed to Redis, just to help with matching in Browsertrix?

- ensure original pageid is used for qa'd pages - use standard ':qa' key to write qa comparison data to with --qaWriteToRedis - print crawl stats in qa - include title + favicons in qa

- add pageEntryForRedis() overridable in replaycrawler to add 'comparison' data - add seperate type for ComparisonData - add comparison data for processPageInfo, if pagestate is available - additional type fixes - remove --qaWriteToRedis, now included with page data

support parsing out the query string when detecting file type

… specified, otherwise load from 'resources'

…g WACZ from 'localhost' for replay QA

bump to 1.1.0-beta.1

ikreymer · 2024-03-20T19:04:15Z

Could we also add the page id to the data pushed to Redis, just to help with matching in Browsertrix?

The QA data is now merged with the page data, so should already be in one place.

Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

- disable redis retryStrategy remove disconnect for redis to avoid unclosed handles - Dockerfile: fix permissions on downloaded files - add qa_compare test to non-root test as well - update jest to latest

ikreymer added 17 commits February 19, 2024 20:11

convert driver to a class that supports crawlPage, setupPage and tear…

e1e7743

…downPage, instead of a single crawlPage function. setupPage / teardownPage called for when a page is created / destroyed

qa work: initial support for crawling over replay!

2c0617c

add missing await, remove console.log

6827788

refactor on new driver format

a00176b

replace driver with ReplayCrawler subclass

7cc741a

keep track of page resources

load WACZ page list directly (via wabac.js ZipRangeReader)

540efeb

crawler: add overridable _addInitialSeeds() function crawler: store archivesDir reload RWP frame if not loaded in SW after 10 secs support max replay pages via --limit store 'pageinfo' records in info.warc.gz

types: fix types for WARCResourceWriter / textextract / screenshots

db491fc

make skipping first N text docs configurable, set to 2 for replaycrawler, 0 by default tests: fix tests due to missing text

resources pageinfo, include redirects

a8869f7

reload timeout: track per page

fix using date for 'ts' field in pageinfo: records to match crawler

cefdf52

add qa option to parseArgs, requires --replaySource but not --seeds

7787d8a

add 'qa' entrypoint to crawler which enables qa mode

diff work: add screenshot, text, and resource comparisons!

d833e2a

(not yet storing)

add comparison to replay pageinfo!

7b8ab4b

typo fixes

222ef1d

experiment with reloading page after initial load (disabled), add dee…

1791f16

…pLink to allow reloading resources: filter our POST requests loading: add check for WACZ loading if resources is not available

update to page info with status/mime/type

e15d25d

rename --replaySource -> --qaSource

59382a3

add --qaDebugImageDiff to enable per-page crawl.png / replay.png / diff.png output support qaSource from file system (via blob), as well as URL

add --qaRedisKey to set redis key for pushing qa comparison data to r…

aca1a64

…edis

ikreymer requested a review from tw4l February 20, 2024 17:40

replayserver: support serving sw.js directly, make RWP version config…

bad67a0

…urable, using CDN version replayserver: if local file path specified, support serving local file under /source.{wacz,json}, support range requests

tw4l approved these changes Feb 21, 2024

View reviewed changes

replay: install RWP files directly into image on build, instead of lo…

3617bb6

…ading from cdn during crawl time

ikreymer added 3 commits February 28, 2024 23:28

Merge branch 'dev-1.0.0' into qa-crawl-work

fb9de39

fixes for 1.0.0-beta.5 merge

0e0d74e

Merge branch 'main' into qa-crawl-work

c987424

ikreymer changed the base branch from dev-1.0.0 to main March 7, 2024 22:22

ikreymer added 4 commits March 7, 2024 14:23

Merge branch 'main' into qa-crawl-work

2d85f2d

misc qa work:

c4231e5

- ensure original pageid is used for qa'd pages - use standard ':qa' key to write qa comparison data to with --qaWriteToRedis - print crawl stats in qa - include title + favicons in qa

Merge branch 'main' into qa-crawl-work

5c42549

ikreymer added 4 commits March 8, 2024 12:53

Merge branch 'main' into qa-crawl-work

0a1018a

qa test: use redis://127.0.0.1:36379 for ci to match other redis usage

0abfaac

tests: try different port for redis

3a9ffd8

support loading multi-wacz .json files locally

d7d6558

support parsing out the query string when detecting file type

ikreymer mentioned this pull request Mar 11, 2024

QA Runs Initial Backend Implementation webrecorder/browsertrix#1586

Merged

3 tasks

ikreymer added 7 commits March 12, 2024 08:08

qa crawl init: support loading pages from json file if 'pages' key is…

aa4ecd5

… specified, otherwise load from 'resources'

disable CORS for replaycrawler (for now) to allow loading any existin…

8d0f411

…g WACZ from 'localhost' for replay QA

cleanup

ceffad9

Merge branch 'main' into qa-crawl-work

251e1b3

bump to 1.1.0-beta.1

Merge branch 'main' into qa-crawl-work

e4d8388

readd parseArgs import

cb435f6

cleanup, add more constants, remove commented out code

52f80d0

more cleanup

aee5af5

ikreymer marked this pull request as ready for review March 20, 2024 19:06

tests: change ports for different tests that use redis to be unique

b18148b

ikreymer and others added 10 commits March 21, 2024 13:23

Merge branch 'main' into qa-crawl-work

ce2ffca

Merge branch 'main' into qa-crawl-work

f6a7dab

tests: fix non-root user tests

387e269

- disable redis retryStrategy remove disconnect for redis to avoid unclosed handles - Dockerfile: fix permissions on downloaded files - add qa_compare test to non-root test as well - update jest to latest

tweak test ci steps

cc5e130

Merge branch 'main' into qa-crawl-work

4979d86

lint fix

c8dc60d

type fix

3c4f552

don't bypass service workers for replay crawl!

ae9fdbe

additional type fixes in browser

cdab557

bump version to 1.1.0-beta.2

a4ef485

ikreymer changed the title ~~QA Crawl Support~~ QA Crawl Support (Beta) Mar 23, 2024

ikreymer merged commit bb9c824 into main Mar 23, 2024
3 of 4 checks passed

ikreymer deleted the qa-crawl-work branch March 23, 2024 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA Crawl Support (Beta) #469

QA Crawl Support (Beta) #469

ikreymer commented Feb 20, 2024 •

edited

tw4l left a comment

tw4l commented Feb 21, 2024

ikreymer commented Mar 20, 2024

QA Crawl Support (Beta) #469

QA Crawl Support (Beta) #469

Conversation

ikreymer commented Feb 20, 2024 • edited

tw4l left a comment

Choose a reason for hiding this comment

tw4l commented Feb 21, 2024

ikreymer commented Mar 20, 2024

ikreymer commented Feb 20, 2024 •

edited