Add endpoints to read pages from older crawl WACZs into database #1562

tw4l · 2024-02-28T22:43:26Z

New endpoints (replacing old migration) to re-add crawl pages to db from WACZs.

After a few implementation attempts, we settled on using remotezip to handle parsing of the zip files and streaming their contents line-by-line for pages. I've also modified the sync log streaming to use remotezip as well, which allows us to remove our own zip module and let remotezip handle the complexity of parsing zip files.

Database inserts for pages from WACZs are batched 100 at a time to help speed up the endpoint, and the task is kicked off using asyncio.create_task so as not to block before giving a response.

StorageOps now contains a method for streaming the bytes of any file in a remote WACZ, requiring only the presigned URL for the WACZ and the name of the file to stream.

tw4l · 2024-03-05T22:41:51Z

@ikreymer This is ready for re-review, now with remotezip. I also switched log streaming to use remotezip, so was able to remove our whole local zip module. I've done some local testing with the following results:

Crawl log streaming from WACZ files is working once crawl finishes (nightly test also passes)
Migration is working in local testing
Memory usage doesn't seem to increase during either 1 or 2, I think we have the streaming working properly now

And remove async streaming/zip parsing methods

backend/btrixcloud/storages.py

tw4l requested a review from ikreymer February 28, 2024 22:43

tw4l marked this pull request as draft February 29, 2024 19:28

tw4l force-pushed the crawl-pages-oom branch 2 times, most recently from 7ea1272 to 89d34db Compare March 5, 2024 21:38

tw4l changed the title ~~Await adding pages from crawls one at a time~~ Add migration to read pages from older crawl WACZs into database Mar 5, 2024

tw4l force-pushed the crawl-pages-oom branch from 248e463 to a25509b Compare March 5, 2024 22:39

tw4l marked this pull request as ready for review March 5, 2024 22:40

tw4l force-pushed the crawl-pages-oom branch from b02ee74 to 2ca35d3 Compare March 6, 2024 20:01

tw4l changed the title ~~Add migration to read pages from older crawl WACZs into database~~ Add endpoints to read pages from older crawl WACZs into database Mar 7, 2024

tw4l added 12 commits March 13, 2024 16:11

Stream crawl pages syncronously with remotezip

d3f7e61

And remove async streaming/zip parsing methods

Use remotezip for log streaming, remove custom zip module

a0a72cc

Fix linting

9d0cef6

Touchups

49d3674

Remove dev debug print statements

5ee7a3f

Add ZipInfo type annotations

978191d

Remove aiostream dependency

81a8365

Bump CURR_DB_VERSION

c75ffe4

Remove migration 0026, add endpoints for re-adding crawl pages

6745b77

Add tests

d1db0b4

Write pages to db in batches

1f00003

Linting fixes

ee764dc

tw4l force-pushed the crawl-pages-oom branch 2 times, most recently from 298b8ab to ac81831 Compare March 14, 2024 16:44

Use browsertrix-cloud-frontend.default for presigned urls

c6a887b

tw4l force-pushed the crawl-pages-oom branch from f8ccb49 to c6a887b Compare March 14, 2024 17:15

ikreymer reviewed Mar 18, 2024

View reviewed changes

backend/btrixcloud/storages.py Outdated Show resolved Hide resolved

tw4l added 2 commits March 18, 2024 17:23

Add frontend alias and use with default namespace to construct WACZ URL

720ba0c

Fix typo

5fa5544

ikreymer merged commit 21ae383 into main Mar 19, 2024
4 checks passed

ikreymer deleted the crawl-pages-oom branch March 19, 2024 21:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add endpoints to read pages from older crawl WACZs into database #1562

Add endpoints to read pages from older crawl WACZs into database #1562

tw4l commented Feb 28, 2024 •

edited

tw4l commented Mar 5, 2024

Add endpoints to read pages from older crawl WACZs into database #1562

Add endpoints to read pages from older crawl WACZs into database #1562

Conversation

tw4l commented Feb 28, 2024 • edited

tw4l commented Mar 5, 2024

tw4l commented Feb 28, 2024 •

edited