Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dev 1.0.0 -> Main #482

Merged
merged 37 commits into from Mar 5, 2024
Merged

Dev 1.0.0 -> Main #482

merged 37 commits into from Mar 5, 2024

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented Mar 5, 2024

Placeholder for fast-forwarding dev-1.0.0 branch to main, in preparation for 1.0.0 release!

ikreymer and others added 30 commits November 7, 2023 21:38
Major refactoring of Browsertrix Crawler to native capture network traffic to WARC files
via the Chrome Debug Protocol (CDP). Allows for more flexibility and accuracy when dealing
with HTTP/2.x sites and avoids a MITM proxy. Addresses #343 

Changes include:
- Recorder class for capture CDP network traffic for each page.
- Handling requests from service workers via matching active frames, skipping unrelated requests outside the page (from background pages, etc..)
- WARC writing support via TS-based warcio.js library.
- Generates single WARC file per worker (still need to add size rollover).
- Request interception via Fetch.requestPaused
- Rule-based rewriting response support (via wabac.js), using Fetch.getResponseBody() / Fetch.fulfillRequest()
- Streaming responses via three methods: inline response fetch via Fetch.takeResponseBodyAsStream, 
async loading via browser network stack with Network.loadNetworkResource() and node-based async fetch
via fetch()
- Direct async fetch() capture of non-HTML URLs
- Awaiting for all requests to finish before moving on to next page, upto page timeout.
- Experimental: generate CDXJ on-the-fly as WARC is being written (not yet in use).
- removed pywb, using cdxj-indexer for --generateCDX option.
Follows #424. Converts the upcoming 1.0.0 branch based on native browser-based traffic capture and recording to TypeScript. Fixes #426

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
Co-authored-by: emma <hi@emma.cafe>
This adds prettier to the repo, and sets up the pre-commit hook to
auto-format as well as lint.
Also updates ignores files to exclude crawls, test-crawls, scratch, dist as needed.
- actually update lint/prettier/git ignore files with scatch, crawls, test-crawls, behaviors, as needed
Previously, responses >2MB are streamed to disk and an empty response returned to browser,
to avoid holding large response in memory. 
This limit was too small, as some HTML pages may be >2MB, resulting in no content loaded.

This PR sets different limits for:
- HTML as well as other JS necessary for page to load to 25MB
- All other content limit is set to 5MB

Also includes some more type fixing
When calling directFetchCapture, and aborting the response via an
exception, throw `new Error("response-filtered-out");`
so that it can be ignored. This exception is only used for direct
capture, and should not be logged as an error - rethrow and
handle in calling function to indicate direct fetch is skipped
Due to an optimization, numPending() call assumed that queueSize() would
be called to update cached queue size. However, in the current worker
code, this is not the case. Remove cacheing the queue size and just check
queue size in numPending(), to ensure pending list is always processed.
- add QueueEntry for type of json object stored in Redis
- and PageCallbacks for callback type
- use Crawler type
- add LogContext type and enumerate all log contexts
- also add LOG_CONTEXT_TYPES array to validate --context arg
- rename errJSON -> formatErr, convert unknown (likely Error) to dict
- make logger info/error/debug accept unknown as well, to avoid explicit 'any' typing in all catch handlers
…ted response support. (#440)

Support for rollover size and custom WARC prefix templates:
- reenable --rolloverSize (default to 1GB) for when a new WARC is
created
- support custom WARC prefix via --warcPrefix, prepended to new WARC
filename, test via basic_crawl.test.js
- filename template for new files is:
`${prefix}-${crawlId}-$ts-${this.workerid}.warc${his.gzip ? ".gz" : ""}`
with `$ts` replaced at new file creation time with current timestamp

Improved support for long (non-terminating) responses, such as from
live-streaming:
- add a size to CDP takeStream to ensure data is streamed in fixed
chunks, defaulting to 64k
- change shutdown order: first close browser, then finish writing all
WARCs to ensure any truncated responses can be captured.
- ensure WARC is not rewritten after it is done, skip writing records if
stream already flushed
  - add timeout to final fetch tasks to avoid never hanging on finish
- fix adding `WARC-Truncated` header, need to set after stream is
finished to determine if its been truncated
- move temp download `tmp-dl` dir to main temp folder, outside of
collection (no need to be there).
- on first page, attempt to evaluate the behavior class to ensure it
compiles
- if fails to compile, log exception with fatal and exit
- update behavior gathering code to keep track of behavior filename
- tests: add test for invalid behavior which causes crawl to exit with
fatal exit code (17)
update yarn.lock
Generate records for each page, containing a list of resources and their
status codes, to aid in future diffing/comparison.

Generates a `urn:pageinfo:<page url>` record for each page
- Adds POST / non-GET request canonicalization from warcio to handle
non-GET requests
- Adds `writeSingleRecord` to WARCWriter

Fixes #457
…st pairs are not written to WARC (#460)

Allows for skipping network traffic that doesn't need to be stored, as
it is not necessary/will result in incorrect replay (eg. 304 instead of
a 200).
Fixes #462 

Add --writePagesToRedis arg, for use conjunction with QA features in Browsertrix Cloud, to add
pages to the database for each crawl.
Ensure timestamp (as ISO date) is added to pages when they are serialized (both to pages.jsonl and redis)
Also include timestamp (as ISO date) in `pageinfo:` records

---------
Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Ensure cached resources (that are not written to WARC) are still
included in the `url:pageinfo:...` records. This will make it easier to
track which resources are actually *loaded* from a given page.

Tests: add test to ensure pageinfo record for webrecorder.net and webrecorder.net/about
include cached resources
- Update to Brave browser (1.62.165)
- Update page resource test to reflect latest Brave behavior
- recorder: don't attempt to record response with mime type
`text/event-stream` (will not terminate).
- resources: don't track non http/https resources.
- resources: store page timestamp on first resources URL match, in case
multiple responses for same page encountered.
The `:pageinfo:<url>` record now includes the mime type + resource type
(from Chrome) along with status code for each resource, for better
filtering / comparison.
In addition to `--warcPrefix` flag, also support WARC_PREFIX env var,
which takes precedence.
Bump to 1.0.0-beta.4
… field (#471)

Add JS errors from console to pageinfo records in additional `counts: {jsErrors: number}` field.
Ensure warcwriter file is inited on first use, instead of throwing error
- was initing from writeRecordPair() but not writeSingleRecord()
Ensure the env var / cli <warc prefix>-<crawlId> is also applied to
`screenshots.warc.gz` and `text.warc.gz`
- if a seed page redirects (page response != seed url), then add the
final url as a new seed with same scope
- add newScopeSeed() to ScopedSeed to duplicate seed with different URL,
store original includes / excludes
- also add check for 'chrome-error://' URLs for the page, and ensure
page is marked as failed if page.url() starts with chrome-error://
- fixes #475
ikreymer and others added 7 commits February 28, 2024 22:56
don't treat non-200 pages as errors, still extract text, take
screenshots, and run behaviors
only consider actual page load errors, eg. chrome-error:// page url, as
errors
Requires webrecorder/browsertrix-behaviors#69 / browsertrix-behaviors
0.5.3, which will add support for behaviors to add links.

Simplify adding links by simply adding the links directly, instead of
batching to 500 links. Errors are already being logged in queueing a new
URL fails.
Add fail on status code option, --failOnInvalidStatus to treat non-200
responses as failures. Can be useful especially when combined with
--failOnFailedSeed or --failOnFailedLimit

requeue: ensure requeued urls are requeued with same depth/priority, not
0
#481)

Add resourcesType value from
https://chromedevtools.github.io/devtools-protocol/tot/Network/#type-ResourceType
as `WARC-Resource-Type` header, lowecased to match puppeteer/playwright convention
fixes #451
follow up to #481, check reqresp.resourceType with lowercase value just
set message based on resourceType value
@ikreymer ikreymer merged commit 65133c9 into main Mar 5, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

3 participants