Releases: webrecorder/browsertrix-crawler
Releases · webrecorder/browsertrix-crawler
Browsertrix Crawler 1.0.0 Beta 3
What's Changed
- Add arg to write pages to Redis by @tw4l in #464
- Page Resources: Include Cached Resources by @ikreymer in #465
Full Changelog: v1.0.0-beta.2...v1.0.0-beta.3
Browsertrix Crawler 1.0.0 Beta 2
What's Changed
- Bump puppeteer-core to ^20.8.2 to patch vulnerability by @tw4l in #459
- Generate urn:pageinfo: records by @ikreymer in #458
- skipping resources: ensure HEAD, OPTIONS, 204, 206, and 304 response/request pairs are not written to WARC by @ikreymer in #460
Full Changelog: 1.0.0-beta.1...v1.0.0-beta.2
Browsertrix Crawler 0.12.4
What's Changed
Full Changelog: v0.12.3...v0.12.4
Browsertrix Crawler 1.0.0 Beta 1
What's Changed
- logging: don't log filtered out direct fetch attempt as error by @ikreymer in #432
- Fix potential for pending list never being processed by @ikreymer in #433
- more specific types additions by @ikreymer in #434
- Backport pending list never being reprocessed by @ikreymer in #438
- Add types + validation for log context options by @ikreymer in #435
- Bump sharp from 0.32.1 to 0.32.6 by @dependabot in #443
- add timeout to final awaitPendingClear() by @ikreymer in #442
- WARC filename prefix + rollover size + improved 'livestream' / truncated response support. by @ikreymer in #440
- detect invalid custom behaviors on load: by @ikreymer in #450
- Merge 0.12.3 into 1.0.0 by @ikreymer in #455
New Contributors
- @dependabot made their first contribution in #443
Full Changelog: 1.0.0-beta.0...1.0.0-beta.1
Browsertrix Crawler 0.12.3
Bug Fix Release: Ensure crawl doesn't get stuck indefinitely on pending requests at the end of the crawl -
What's Changed
- Bump sharp from 0.32.1 to 0.32.6 by @dependabot in #443
- add timeout to final awaitPendingClear() by @ikreymer in #442
Full Changelog: v0.12.2...v0.12.3
Browsertrix Crawler 0.12.2
What's Changed
Full Changelog: v0.12.1...v0.12.2
Browsertrix Crawler 1.0.0 Beta 0
Major Changes
- New recording/capture mechanism using browser CDP network traffic, instead of proxy
- TypeScript conversion
What's Changed
- Use new browser-based archiving mechanism instead of pywb proxy by @ikreymer in #424
- TypeScript Conversion by @ikreymer in #425
- Add Prettier to the repo, and format all the files! by @emma-sg in #428
- follow-up to #428: update ignore files by @ikreymer in #431
- Raise size limit for large HTML pages by @ikreymer in #430
Full Changelog: v0.12.1...1.0.0-beta.0
Browsertrix Crawler 0.12.1
Fixes
- Optimize exclusion removal, follow-up to #408
- Fix regression with
--text false
being rejected, while in use with Browsertrix Cloud (see: webrecorder/browsertrix#1334)
What's Changed
- Exclusion Filtering Optimizations: check exclusion before loading new page + additional improvements @ikreymer in #423
Full Changelog: v0.12.0...v0.12.1
Browsertrix Crawler 0.12.0
Major Changes
- Use Brave same version of Brave for base image, instead of slightly different Chrome (amd64) and Chromium (arm64)
- Support for faster cancelation of crawl via Redis key + signal
- Include CRC32 in storage webhook for nested WACZ support
- Dynamic exclusion addition/queue filter/removal via redis message queue
- Text extraction stored in WARC records (both initial and final page after behaviors) with new --text options
What's Changed
- Switch to Brave Base Image by @ikreymer in #400
- Store crawler start and end times in Redis lists by @tw4l in #397
- additional failure logic: by @ikreymer in #402
- tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
- Fast cancelation + remove time counter by @ikreymer in #406
- disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
- storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
- Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
- load saved state fixes + redis tests by @ikreymer in #415
- Return User-Agent on all code path to set headers appropriately by @benoit74 in #420
- improved text extraction: (addresses #403) by @ikreymer in #404
- More flexible multi value arg parsing + README update for 0.12.0 by @ikreymer in #422
Full Changelog: v0.11.2...v0.12.0
Browsertix Crawler 0.12.0 Beta 2
What's Changed
- disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
- storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
- Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
- load saved state fixes + redis tests by @ikreymer in #415
- Return User-Agent on all code path to set headers appropriately by @benoit74 in #420
Full Changelog: v0.12.0-beta.1...v0.12.0-beta.2