QA Runs Initial Backend Implementation #1586

ikreymer · 2024-03-11T01:28:35Z

Supports running QA Runs via the QA API:

Builds on top of the issue-1498-crawl-qa-backend-support branch, fixes #1498

Also requires the latest Browsertrix Crawler from webrecorder/browsertrix-crawler#469 branch.

Notable changes:

Various crawl db operations can be performed on either the crawl or qa. object, and core crawl fields have been moved to CoreCrawlable.
While running,QARun data stored in a single qa object, while finished qa runs are added to qaFinished dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first.
Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API

Still need:

Tests!
Determining how active QA data is returned (separate API or part of finished list, or just part of crawl).
Filter page data by a single QA run

and treat as 'crawl_or_qa_run_id' and 'qa_source_crawl_id' in crawl ops

- rename original get_crawl -> get_crawl_out as its used for api response - basecrawls has get_base_crawl and get_crawl_raw and get_crawl_out - crawls has get_crawl - uploads has get_uplads

qa runs work!

- active qa data added to 'qa' - finished qa later added to 'qa_finished' when crawl finished - active qa cleared on operator finalize - default text extract to 'to-warc' for qa support

fix qa replay additional type fixes, look up Org if missing, needed for resolving presigned URLs

- ensure .from_dict() has proper generic type - use type Crawl objects for better type safety - add tags to BaseCrawl - fix delete_crawls() typing test: update for page model change

…d to finished atomatically set ttl for qa crawljobs to 0 to finalize more quickly add shared 'qaCrawlExecSeconds' to crawl to track shared qa exec time usage

tw4l

Some initial comments.

Initial testing shows that some pages aren't getting updated, possibly if they redirect to another page.

backend/btrixcloud/basecrawls.py

backend/btrixcloud/crawls.py

backend/btrixcloud/db.py

backend/btrixcloud/main.py

backend/btrixcloud/operator/models.py

backend/btrixcloud/operator/crawls.py

- /qa/<qa_run_id>/pages - /activeQA - split Page model into core Page, Page with all qa, and Page with single QA data - support filtering with qa_run_id, lte / gte, qa_range_field query args - fix typo in qa stats, use float instead of int!

- fix qa run stats update while running - fix qa run pages, add filterBy, gte, gt, lte and lt query args - include active QA in QA list, update model

… match pages.jsonl data model

use 1.1.0-beta.0 crawler for QA support

ikreymer · 2024-03-13T04:57:53Z

Per discussion, additional APIs have been added / modified:

.../activeQA - returns only the active QA
.../qa - list includes activeQA data first.
.../qa/{qa_run_id}/pages - filters pages by qa_run_id, also supports filterBy=[screenshotMatch | textMatch] query arg, and gt=, gte=, lt=, lte= query args for filtering pages by specific ranges.

…rom both crawl and pages

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

… org model

add resolve_internal_access_path() to storages which prepends the frontend_origin in get_internal_crawl_out()

backend/test/test_qa.py

tw4l

Looks good! Been testing on dev and locally, and went through the code again. Made a suggestion for an extra test to add but otherwise I think we're good to merge to main!

The other thing we discussed but decided to do in a follow-up is to add a check before starting QA runs that the crawl was not created with a 0.x.x version of the crawler, and to return an appropriate status code/not start the QA run if so. We can check Crawl.image for this.

… as well by calling inc_crawl_complete_stats for both regular crawls and qa runs

tw4l and others added 23 commits March 5, 2024 11:16

Start adding backend support to start and stop qa crawl jobs

dfea2ff

Fixups

e8f0da3

WIP: Add replay.json endpoint for QA crawls

5dbab3f

Add API endpoint to return QACrawls associated with a crawl

f238d25

Flesh out method docstring

31ea444

Fixups

447b142

Minor touchups

b1514f2

Use qa-crawlid-ts naming scheme for QA crawls

67481ac

Merge branch 'main' into issue-1498-crawl-qa-backend-support

1c3de77

qa run work, unifying interface

af5c866

more refactoring of apis for qa

93b2652

update operator + crawl_ops calls to pass crawl_id + source_crawl_id

e849529

and treat as 'crawl_or_qa_run_id' and 'qa_source_crawl_id' in crawl ops

store qa compare data in unified PageQACompare from page_dict

f2ca46c

fixes

3e1bda2

refactor basecrawls / crawls / uploads getters:

8b0e6c7

- rename original get_crawl -> get_crawl_out as its used for api response - basecrawls has get_base_crawl and get_crawl_raw and get_crawl_out - crawls has get_crawl - uploads has get_uplads

various fixes, add separate configmap for qa

23f5a24

qa runs work!

simplify qa model:

5d66279

- active qa data added to 'qa' - finished qa later added to 'qa_finished' when crawl finished - active qa cleared on operator finalize - default text extract to 'to-warc' for qa support

rename qa_finished -> qaFinished

59e1d70

fix qa replay additional type fixes, look up Org if missing, needed for resolving presigned URLs

various type fixes to fix tests:

caf82f1

- ensure .from_dict() has proper generic type - use type Crawl objects for better type safety - add tags to BaseCrawl - fix delete_crawls() typing test: update for page model change

move finished qa run to fininshed in finalizer, perform clear qa + ad…

4a72080

…d to finished atomatically set ttl for qa crawljobs to 0 to finalize more quickly add shared 'qaCrawlExecSeconds' to crawl to track shared qa exec time usage

page ops: create index on 'crawl_id'

ec9240b

Merge branch 'main' into qa-run

046af50

lint fix

5942a1b

tw4l reviewed Mar 11, 2024

View reviewed changes

backend/btrixcloud/operator/crawls.py Outdated Show resolved Hide resolved

ikreymer and others added 5 commits March 12, 2024 08:04

additional pages APIs:

b2581fc

- /qa/<qa_run_id>/pages - /activeQA - split Page model into core Page, Page with all qa, and Page with single QA data - support filtering with qa_run_id, lte / gte, qa_range_field query args - fix typo in qa stats, use float instead of int!

update test, remove 'qa' from default /pages response

4d332e9

Adjust priorities without changing bg-jobs

cdf7a71

Change qa crawl priority back

87a62a8

Allow None for stats for reverse compatibility

9b4b393

ikreymer and others added 7 commits March 12, 2024 12:49

lint: fix conditional for stats being optional

fc7b1ef

Remove duplicate notes field in sort_fields

896787f

Print log pages data in operator

20a3083

additional fixes:

bc7c668

- fix qa run stats update while running - fix qa run pages, add filterBy, gte, gt, lte and lt query args - include active QA in QA list, update model

page model rename: rename load_state -> loadState, timestamp -> ts to…

3c58cb9

… match pages.jsonl data model

tests: fix page test

ce58e38

initial QA tests!

ff9592e

use 1.1.0-beta.0 crawler for QA support

ikreymer and others added 2 commits March 13, 2024 01:42

qa: fix qa deletion, add test for deleting qa run, ensuring removed f…

571b306

…rom both crawl and pages

Update backend/btrixcloud/basecrawls.py

b544a59

Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>

ikreymer changed the title ~~QA Runs Initial Implementation~~ QA Runs Initial Backend Implementation Mar 13, 2024

ikreymer marked this pull request as ready for review March 13, 2024 15:17

ikreymer added 11 commits March 13, 2024 12:19

also track qa-only exec time and usage, add qausage and qaexectime to…

d61b8d9

… org model

change delete endpoint to use POST, accept a list of qa_run_ids

310780a

bump kubernetes-asyncio to 29.0.0

d512def

fix qa org keys to be properly cased

ce72edb

Merge branch 'main' into qa-run

7bbdffd

additional type fixes

43e2d0b

fix Page constructor

cbf3c99

fix test, update crawler image to 1.1.0-beta.1

fa44420

fix load_state -> loadState

039f273

fix models to contain qaUsage, qaCrawlExecSeconds in org and org out

f5d1c02

rename FRONTEND_ALIAS -> FRONTEND_ORIGIN

9e9f352

add resolve_internal_access_path() to storages which prepends the frontend_origin in get_internal_crawl_out()

tw4l reviewed Mar 20, 2024

View reviewed changes

backend/test/test_qa.py Show resolved Hide resolved

tw4l approved these changes Mar 20, 2024

View reviewed changes

add test for qaCrawlExecuted and qaUsage, ensure qaUsage is populated…

bc3bc44

… as well by calling inc_crawl_complete_stats for both regular crawls and qa runs

ikreymer mentioned this pull request Mar 20, 2024

[Feature]: Additional filtering for pages #1617

Closed

filterBy -> filterQABy

55994a1

ikreymer merged commit 4f676e4 into main Mar 21, 2024
4 checks passed

ikreymer deleted the qa-run branch March 21, 2024 05:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QA Runs Initial Backend Implementation #1586

QA Runs Initial Backend Implementation #1586

ikreymer commented Mar 11, 2024 •

edited

tw4l left a comment

ikreymer commented Mar 13, 2024

tw4l left a comment

QA Runs Initial Backend Implementation #1586

QA Runs Initial Backend Implementation #1586

Conversation

ikreymer commented Mar 11, 2024 • edited

tw4l left a comment

Choose a reason for hiding this comment

ikreymer commented Mar 13, 2024

tw4l left a comment

Choose a reason for hiding this comment

ikreymer commented Mar 11, 2024 •

edited