Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Crawl Support (Beta) #469

Merged
merged 50 commits into from Mar 23, 2024
Merged

QA Crawl Support (Beta) #469

merged 50 commits into from Mar 23, 2024

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Feb 20, 2024

Initial support for QA crawling, crawling over an existing replay to generate QA/comparison data.
Can be deployed with webrecorder/browsertrix-crawler qa entrypoint.

Requires --qaSource, pointing to WACZ or multi-WACZ json that will be QAd.

Also supports --qaRedisKey where QA comparison data will be pushed, if specified.
Supports --qaDebugImageDiff for outputting crawl / replay/ diff images.

The data pushed to redis is {"url": <page url>", "comparison": <...>"} where comparison is:

  comparison: {
    screenshotMatch?: number;
    textMatch?: number;
    resourceCounts: {
      crawlGood?: number;
      crawlBad?: number;
      replayGood?: number;
      replayBad?: number;
    };
  };

…downPage, instead of a single crawlPage function.

setupPage / teardownPage called for when a page is created / destroyed
keep track of page resources
crawler: add overridable _addInitialSeeds() function
crawler: store archivesDir
reload RWP frame if not loaded in SW after 10 secs
support max replay pages via --limit
store 'pageinfo' records in info.warc.gz
make skipping first N text docs configurable, set to 2 for replaycrawler, 0 by default
tests: fix tests due to missing text
reload timeout: track per page
add 'qa' entrypoint to crawler which enables qa mode
…pLink to allow reloading

resources: filter our POST requests
loading: add check for WACZ loading if resources is not available
add --qaDebugImageDiff to enable per-page crawl.png / replay.png / diff.png output
support qaSource from file system (via blob), as well as URL
@ikreymer ikreymer requested a review from tw4l February 20, 2024 17:40
…urable, using CDN version

replayserver: if local file path specified, support serving local file under /source.{wacz,json}, support range requests
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good! Tested with local WACZs as well as serving files via an http server.

I've left some comments and suggestions (suggestions are mostly adding some logging and removing some commented out bits that are no longer necessary).

We could use some tests - possibly an integration test to start that runs a crawl, then QAs the crawl, and checks the outputs?

src/replaycrawler.ts Outdated Show resolved Hide resolved
src/replaycrawler.ts Show resolved Hide resolved
src/replaycrawler.ts Outdated Show resolved Hide resolved
src/replaycrawler.ts Outdated Show resolved Hide resolved
src/replaycrawler.ts Show resolved Hide resolved
src/replaycrawler.ts Outdated Show resolved Hide resolved
src/replaycrawler.ts Outdated Show resolved Hide resolved
src/replaycrawler.ts Outdated Show resolved Hide resolved
src/replaycrawler.ts Outdated Show resolved Hide resolved
src/util/state.ts Outdated Show resolved Hide resolved
@tw4l
Copy link
Contributor

tw4l commented Feb 21, 2024

Could we also add the page id to the data pushed to Redis, just to help with matching in Browsertrix?

@ikreymer ikreymer changed the base branch from dev-1.0.0 to main March 7, 2024 22:22
- ensure original pageid is used for qa'd pages
- use standard ':qa' key to write qa comparison data to with --qaWriteToRedis
- print crawl stats in qa
- include title + favicons in qa
- add pageEntryForRedis() overridable in replaycrawler to add 'comparison' data
- add seperate type for ComparisonData
- add comparison data for processPageInfo, if pagestate is available
- additional type fixes
- remove --qaWriteToRedis, now included with page data
@ikreymer
Copy link
Member Author

Could we also add the page id to the data pushed to Redis, just to help with matching in Browsertrix?

The QA data is now merged with the page data, so should already be in one place.

@ikreymer ikreymer marked this pull request as ready for review March 20, 2024 19:06
ikreymer added a commit to webrecorder/browsertrix that referenced this pull request Mar 21, 2024
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
@ikreymer ikreymer changed the title QA Crawl Support QA Crawl Support (Beta) Mar 23, 2024
@ikreymer ikreymer merged commit bb9c824 into main Mar 23, 2024
3 of 4 checks passed
@ikreymer ikreymer deleted the qa-crawl-work branch March 23, 2024 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants