Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler 1.0.0 Beta 3

16 Feb 23:20
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Add arg to write pages to Redis by @tw4l in #464
  • Page Resources: Include Cached Resources by @ikreymer in #465

Full Changelog: v1.0.0-beta.2...v1.0.0-beta.3

Browsertrix Crawler 1.0.0 Beta 2

17 Jan 22:48
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Bump puppeteer-core to ^20.8.2 to patch vulnerability by @tw4l in #459
  • Generate urn:pageinfo: records by @ikreymer in #458
  • skipping resources: ensure HEAD, OPTIONS, 204, 206, and 304 response/request pairs are not written to WARC by @ikreymer in #460

Full Changelog: 1.0.0-beta.1...v1.0.0-beta.2

Browsertrix Crawler 0.12.4

17 Jan 22:13
cd3a1b0
Compare
Choose a tag to compare

What's Changed

  • Bump puppeteer-core to ^20.8.2 to patch vulnerability by @tw4l in #459

Full Changelog: v0.12.3...v0.12.4

Browsertrix Crawler 1.0.0 Beta 1

03 Jan 09:01
Compare
Choose a tag to compare
Pre-release

What's Changed

  • logging: don't log filtered out direct fetch attempt as error by @ikreymer in #432
  • Fix potential for pending list never being processed by @ikreymer in #433
  • more specific types additions by @ikreymer in #434
  • Backport pending list never being reprocessed by @ikreymer in #438
  • Add types + validation for log context options by @ikreymer in #435
  • Bump sharp from 0.32.1 to 0.32.6 by @dependabot in #443
  • add timeout to final awaitPendingClear() by @ikreymer in #442
  • WARC filename prefix + rollover size + improved 'livestream' / truncated response support. by @ikreymer in #440
  • detect invalid custom behaviors on load: by @ikreymer in #450
  • Merge 0.12.3 into 1.0.0 by @ikreymer in #455

New Contributors

Full Changelog: 1.0.0-beta.0...1.0.0-beta.1

Browsertrix Crawler 0.12.3

17 Nov 07:27
c3b98e5
Compare
Choose a tag to compare

Bug Fix Release: Ensure crawl doesn't get stuck indefinitely on pending requests at the end of the crawl -

What's Changed

Full Changelog: v0.12.2...v0.12.3

Browsertrix Crawler 0.12.2

15 Nov 02:19
9ba0b9e
Compare
Choose a tag to compare

What's Changed

  • Fix for pending list never being reprocessed in some situations by @ikreymer in #438

Full Changelog: v0.12.1...v0.12.2

Browsertrix Crawler 1.0.0 Beta 0

10 Nov 07:55
ab0f66a
Compare
Choose a tag to compare
Pre-release

Major Changes

  • New recording/capture mechanism using browser CDP network traffic, instead of proxy
  • TypeScript conversion

What's Changed

Full Changelog: v0.12.1...1.0.0-beta.0

Browsertrix Crawler 0.12.1

03 Nov 22:18
dd7b926
Compare
Choose a tag to compare

Fixes

  • Optimize exclusion removal, follow-up to #408
  • Fix regression with --text false being rejected, while in use with Browsertrix Cloud (see: webrecorder/browsertrix#1334)

What's Changed

  • Exclusion Filtering Optimizations: check exclusion before loading new page + additional improvements @ikreymer in #423

Full Changelog: v0.12.0...v0.12.1

Browsertrix Crawler 0.12.0

02 Nov 18:55
15661eb
Compare
Choose a tag to compare

Major Changes

  • Use Brave same version of Brave for base image, instead of slightly different Chrome (amd64) and Chromium (arm64)
  • Support for faster cancelation of crawl via Redis key + signal
  • Include CRC32 in storage webhook for nested WACZ support
  • Dynamic exclusion addition/queue filter/removal via redis message queue
  • Text extraction stored in WARC records (both initial and final page after behaviors) with new --text options

What's Changed

  • Switch to Brave Base Image by @ikreymer in #400
  • Store crawler start and end times in Redis lists by @tw4l in #397
  • additional failure logic: by @ikreymer in #402
  • tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
  • Fast cancelation + remove time counter by @ikreymer in #406
  • disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
  • storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
  • Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
  • load saved state fixes + redis tests by @ikreymer in #415
  • Return User-Agent on all code path to set headers appropriately by @benoit74 in #420
  • improved text extraction: (addresses #403) by @ikreymer in #404
  • More flexible multi value arg parsing + README update for 0.12.0 by @ikreymer in #422

Full Changelog: v0.11.2...v0.12.0

Browsertix Crawler 0.12.0 Beta 2

28 Oct 01:36
Compare
Choose a tag to compare
Pre-release

What's Changed

  • disable component updates by setting --component-updater to invalid URL by @ikreymer in #413
  • storage: also compute crc32 as part of storage webhook when uploading… by @ikreymer in #414
  • Support adding/removing exclusions without restarting the crawler by @ikreymer in #408
  • load saved state fixes + redis tests by @ikreymer in #415
  • Return User-Agent on all code path to set headers appropriately by @benoit74 in #420

Full Changelog: v0.12.0-beta.1...v0.12.0-beta.2