Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QA Runs Initial Backend Implementation #1586

Merged
merged 52 commits into from Mar 21, 2024
Merged

QA Runs Initial Backend Implementation #1586

merged 52 commits into from Mar 21, 2024

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Mar 11, 2024

Supports running QA Runs via the QA API:

Screenshot 2024-03-10 at 6 20 36 PM

Builds on top of the issue-1498-crawl-qa-backend-support branch, fixes #1498

Also requires the latest Browsertrix Crawler from webrecorder/browsertrix-crawler#469 branch.

Notable changes:

  • Various crawl db operations can be performed on either the crawl or qa. object, and core crawl fields have been moved to CoreCrawlable.

  • While running,QARun data stored in a single qa object, while finished qa runs are added to qaFinished dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first.

  • Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API

Still need:

  • Tests!
  • Determining how active QA data is returned (separate API or part of finished list, or just part of crawl).
  • Filter page data by a single QA run

tw4l and others added 23 commits March 5, 2024 11:16
and treat as 'crawl_or_qa_run_id' and 'qa_source_crawl_id' in crawl ops
- rename original get_crawl -> get_crawl_out as its used for api response
- basecrawls has get_base_crawl and get_crawl_raw and get_crawl_out
- crawls has get_crawl
- uploads has get_uplads
- active qa data added to 'qa'
- finished qa later added to 'qa_finished' when crawl finished
- active qa cleared on operator finalize
- default text extract to 'to-warc' for qa support
fix qa replay
additional type fixes, look up Org if missing, needed for resolving presigned URLs
- ensure .from_dict() has proper generic type
- use type Crawl objects for better type safety
- add tags to BaseCrawl
- fix delete_crawls() typing

test: update for page model change
…d to finished atomatically

set ttl for qa crawljobs to 0 to finalize more quickly
add shared 'qaCrawlExecSeconds' to crawl to track shared qa exec time usage
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

Initial testing shows that some pages aren't getting updated, possibly if they redirect to another page.

backend/btrixcloud/basecrawls.py Outdated Show resolved Hide resolved
backend/btrixcloud/crawls.py Outdated Show resolved Hide resolved
backend/btrixcloud/db.py Show resolved Hide resolved
backend/btrixcloud/main.py Show resolved Hide resolved
backend/btrixcloud/operator/models.py Show resolved Hide resolved
ikreymer and others added 5 commits March 12, 2024 08:04
- /qa/<qa_run_id>/pages
- /activeQA

- split Page model into core Page, Page with all qa, and Page with single QA data
- support filtering with qa_run_id, lte / gte, qa_range_field query args

- fix typo in qa stats, use float instead of int!
ikreymer and others added 7 commits March 12, 2024 12:49
- fix qa run stats update while running
- fix qa run pages, add filterBy, gte, gt, lte and lt query args
- include active QA in QA list, update model
use 1.1.0-beta.0 crawler for QA support
@ikreymer
Copy link
Member Author

Per discussion, additional APIs have been added / modified:

.../activeQA - returns only the active QA
.../qa - list includes activeQA data first.
.../qa/{qa_run_id}/pages - filters pages by qa_run_id, also supports filterBy=[screenshotMatch | textMatch] query arg, and gt=, gte=, lt=, lte= query args for filtering pages by specific ranges.

ikreymer and others added 2 commits March 13, 2024 01:42
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
@ikreymer ikreymer changed the title QA Runs Initial Implementation QA Runs Initial Backend Implementation Mar 13, 2024
@ikreymer ikreymer marked this pull request as ready for review March 13, 2024 15:17
Copy link
Contributor

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Been testing on dev and locally, and went through the code again. Made a suggestion for an extra test to add but otherwise I think we're good to merge to main!

The other thing we discussed but decided to do in a follow-up is to add a check before starting QA runs that the crawl was not created with a 0.x.x version of the crawler, and to return an appropriate status code/not start the QA run if so. We can check Crawl.image for this.

… as well

by calling inc_crawl_complete_stats for both regular crawls and qa runs
@ikreymer ikreymer merged commit 4f676e4 into main Mar 21, 2024
4 checks passed
@ikreymer ikreymer deleted the qa-run branch March 21, 2024 05:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

QA Backend: Add support for Crawl QA jobs
2 participants