New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QA Crawl Support (Beta) #469
Conversation
…downPage, instead of a single crawlPage function. setupPage / teardownPage called for when a page is created / destroyed
keep track of page resources
crawler: add overridable _addInitialSeeds() function crawler: store archivesDir reload RWP frame if not loaded in SW after 10 secs support max replay pages via --limit store 'pageinfo' records in info.warc.gz
make skipping first N text docs configurable, set to 2 for replaycrawler, 0 by default tests: fix tests due to missing text
reload timeout: track per page
add 'qa' entrypoint to crawler which enables qa mode
(not yet storing)
…pLink to allow reloading resources: filter our POST requests loading: add check for WACZ loading if resources is not available
add --qaDebugImageDiff to enable per-page crawl.png / replay.png / diff.png output support qaSource from file system (via blob), as well as URL
…urable, using CDN version replayserver: if local file path specified, support serving local file under /source.{wacz,json}, support range requests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good! Tested with local WACZs as well as serving files via an http server.
I've left some comments and suggestions (suggestions are mostly adding some logging and removing some commented out bits that are no longer necessary).
We could use some tests - possibly an integration test to start that runs a crawl, then QAs the crawl, and checks the outputs?
…ading from cdn during crawl time
Could we also add the page id to the data pushed to Redis, just to help with matching in Browsertrix? |
- ensure original pageid is used for qa'd pages - use standard ':qa' key to write qa comparison data to with --qaWriteToRedis - print crawl stats in qa - include title + favicons in qa
- add pageEntryForRedis() overridable in replaycrawler to add 'comparison' data - add seperate type for ComparisonData - add comparison data for processPageInfo, if pagestate is available - additional type fixes - remove --qaWriteToRedis, now included with page data
support parsing out the query string when detecting file type
… specified, otherwise load from 'resources'
…g WACZ from 'localhost' for replay QA
bump to 1.1.0-beta.1
The QA data is now merged with the page data, so should already be in one place. |
Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
- disable redis retryStrategy remove disconnect for redis to avoid unclosed handles - Dockerfile: fix permissions on downloaded files - add qa_compare test to non-root test as well - update jest to latest
Initial support for QA crawling, crawling over an existing replay to generate QA/comparison data.
Can be deployed with
webrecorder/browsertrix-crawler qa
entrypoint.Requires
--qaSource
, pointing to WACZ or multi-WACZ json that will be QAd.Also supports
--qaRedisKey
where QA comparison data will be pushed, if specified.Supports
--qaDebugImageDiff
for outputting crawl / replay/ diff images.The data pushed to redis is
{"url": <page url>", "comparison": <...>"}
where comparison is: