New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QA Runs Initial Backend Implementation #1586
Conversation
and treat as 'crawl_or_qa_run_id' and 'qa_source_crawl_id' in crawl ops
- rename original get_crawl -> get_crawl_out as its used for api response - basecrawls has get_base_crawl and get_crawl_raw and get_crawl_out - crawls has get_crawl - uploads has get_uplads
qa runs work!
- active qa data added to 'qa' - finished qa later added to 'qa_finished' when crawl finished - active qa cleared on operator finalize - default text extract to 'to-warc' for qa support
fix qa replay additional type fixes, look up Org if missing, needed for resolving presigned URLs
- ensure .from_dict() has proper generic type - use type Crawl objects for better type safety - add tags to BaseCrawl - fix delete_crawls() typing test: update for page model change
…d to finished atomatically set ttl for qa crawljobs to 0 to finalize more quickly add shared 'qaCrawlExecSeconds' to crawl to track shared qa exec time usage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments.
Initial testing shows that some pages aren't getting updated, possibly if they redirect to another page.
- /qa/<qa_run_id>/pages - /activeQA - split Page model into core Page, Page with all qa, and Page with single QA data - support filtering with qa_run_id, lte / gte, qa_range_field query args - fix typo in qa stats, use float instead of int!
- fix qa run stats update while running - fix qa run pages, add filterBy, gte, gt, lte and lt query args - include active QA in QA list, update model
… match pages.jsonl data model
use 1.1.0-beta.0 crawler for QA support
Per discussion, additional APIs have been added / modified: .../activeQA - returns only the active QA |
…rom both crawl and pages
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
add resolve_internal_access_path() to storages which prepends the frontend_origin in get_internal_crawl_out()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Been testing on dev and locally, and went through the code again. Made a suggestion for an extra test to add but otherwise I think we're good to merge to main!
The other thing we discussed but decided to do in a follow-up is to add a check before starting QA runs that the crawl was not created with a 0.x.x
version of the crawler, and to return an appropriate status code/not start the QA run if so. We can check Crawl.image
for this.
… as well by calling inc_crawl_complete_stats for both regular crawls and qa runs
Supports running QA Runs via the QA API:
Builds on top of the
issue-1498-crawl-qa-backend-support
branch, fixes #1498Also requires the latest Browsertrix Crawler from webrecorder/browsertrix-crawler#469 branch.
Notable changes:
Various crawl db operations can be performed on either the crawl or
qa.
object, and core crawl fields have been moved to CoreCrawlable.While running,
QARun
data stored in a singleqa
object, while finished qa runs are added toqaFinished
dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first.Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API
Still need: