Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler 0.10.0 Beta 4

22 May 23:26
f51154f
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Improve thumbnails with sharp by @tw4l in #304
  • Chrome 112 + new headless mode + consistent viewport tweaks by @ikreymer in #316

Full Changelog: 0.10.0-beta.3...0.10.0-beta.4

Browsertrix Crawler 0.10.0 Beta 3

19 May 14:49
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Disable Chrome optimization logic by @malemburg in #312
  • stopping: if crawl is marked as stopping, and no warcs found, mark st… by @ikreymer in #314

New Contributors

Full Changelog: 0.10.0-beta.2...0.10.0-beta.3

Browsertrix Crawler 0.10.0 Beta 2

07 May 21:14
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Log fatal messages to redis errors by @tw4l in #305
  • Consolidate wacz error loglines by @tw4l in #306
  • state: adjust redis keys to be more consistent by @ikreymer in #309
  • pywb: don't convert bounded range requests to unbounded (pywb 2.7.4 dev)

Full Changelog: 0.10.0-beta.1...0.10.0-beta.2

Browsertrix Crawler 0.10.0 Beta 1

06 May 07:24
Compare
Choose a tag to compare
Pre-release

What's Changed

Full Changelog: 0.10.0-beta.0...0.10.0-beta.1

Browsertrix Crawler 0.10.0 Beta 0

27 Apr 00:41
d4bc9e8
Compare
Choose a tag to compare
Pre-release

Breaking Changes

  • Switch back to Puppeteer from Playwright due to memory issues (#298)
  • Internal: redis key {crawl_id}:d now a number of pages done instead of a list of pages done

What's Changed

  • Add option to log errors to redis by @tw4l in #279
  • Store done in redis as integer and only save full json in redis for failed pages by @tw4l in #284
  • worker: lower wait time, in case where no additional pages remain and… by @ikreymer in #289
  • Store archive dir size in Redis by @tw4l in #291
  • origin override: add --originOverride source=dest to allow routing wh… by @ikreymer in #281
  • Quick exit on redis connection error after interrupt by @ikreymer in #292
  • Fixes from 0.9.1 by @ikreymer in #297
  • Switch back to Puppeteer from Playwright by @ikreymer in #301
  • Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed by @tw4l in #300

Full Changelog: 0.9.0...0.10.0-beta.0

Browsertrix Crawler 0.9.1

24 Apr 16:59
a2a38ce
Compare
Choose a tag to compare

Bug fix release for screenshots and service workers.

What's Changed

  • Fix full page screenshot by @tw4l in #296
  • Fix Service Workers being blocked in change to Playwright. (Enabled by default, disabled when profiles are used, to match 0.8.x functionality)

Full Changelog: 0.9.0...0.9.1

Browsertrix Cloud 0.9.0

08 Apr 00:52
Compare
Choose a tag to compare

Major Changes

  • BREAKING: Switched from Puppeteer to Playwright. Custom drivers would need to be migrated, see: https://playwright.dev/docs/puppeteer or https://github.com/checkly/puppeteer-to-playwright tool
  • Removed puppeteer cluster
  • Always using Redis-based crawl state
  • Use priority based crawl queue, with URLs of lower depth crawled first and extra hops always crawled last.
  • Store 'loadState' in each page, indicating level of loading, bail behaviors run if initial load fails
  • Improved timeouts for each page (page load time + behavior time + extra delay)
  • New options, including: --pageExtraDelay, --diskUtilization, --maxPageLimit, --title, --description, --logLevel, --context

What's Changed

  • logging: serialize regex as string to avoid empty '{}' when logging s… by @ikreymer in #235
  • Remove puppeteer-cluster by @tw4l in #219
  • Fix size check by @ikreymer in #241
  • Add timedRun to prevent async operations from hanging by @tw4l in #243
  • Add total timeout + limit redis queue retries by @ikreymer in #248
  • Minor crawler fixes after puppeteer-cluster removal refactoring by @tw4l in #250
  • Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State by @ikreymer in #253
  • Logger cleanup by @ikreymer in #254
  • Catch loading issues by @ikreymer in #255
  • Add option for sleep interval after behaviors run by @tw4l in #257
  • worker index: set worker index automatically to work with k8s naming by @ikreymer in #266
  • Reset locked pending URLs when crawler restarts. by @ikreymer in #267
  • Ensure crawler can't run out of space with --diskUtilization param by @tw4l in #264
  • Add options to filter logs by --logLevel and --context by @tw4l in #271
  • Update README for 0.9.0 by @tw4l in #272
  • blockrules/logger: use global logger var by @ikreymer in #274
  • Add --maxPageLimit override by @ikreymer in #275
  • Add --title and --description CLI args to write metadata into datapackage.json by @tw4l in #276
  • Don't set viewport for full page screenshots by @tw4l in #221

Full Changelog: 0.8.1...0.9.0

Browsertrix Crawler 0.9.0 Beta 2

03 Apr 19:20
Compare
Choose a tag to compare
Pre-release

What's Changed

  • worker index: set worker index automatically to work with k8s naming by @ikreymer in #266
  • Reset locked pending URLs when crawler restarts. by @ikreymer in #267
  • Ensure crawler can't run out of space with --diskUtilization param by @tw4l in #264
  • Add options to filter logs by --logLevel and --context by @tw4l in #271
  • Update README for 0.9.0 by @tw4l in #272
  • blockrules/logger: use global logger var by @ikreymer in #274
  • Add --maxPageLimit override by @ikreymer in #275

Full Changelog: 0.9.0-beta.1...0.9.0-beta.2

Browsertix Crawler 0.9.0 Beta 1

23 Mar 15:41
b0e93cb
Compare
Choose a tag to compare
Pre-release

Major Changes

What's Changed

  • logging: serialize regex as string to avoid empty '{}' when logging s… by @ikreymer in #235
  • Remove puppeteer-cluster by @tw4l in #219
  • Fix size check by @ikreymer in #241
  • Add timedRun to prevent async operations from hanging by @tw4l in #243
  • Add total timeout + limit redis queue retries by @ikreymer in #248
  • Minor crawler fixes after puppeteer-cluster removal refactoring by @tw4l in #250
  • Dev 0.9.0 Beta 1 Work - Playwright Removal + Worker Refactor + Redis State by @ikreymer in #253
  • Logger cleanup by @ikreymer in #254
  • Catch loading issues by @ikreymer in #255
  • Add option for sleep interval after behaviors run by @tw4l in #257

Full Changelog: 0.8.1...0.9.0-beta.1

Browsertix Crawler 0.8.1

25 Feb 02:34
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: 0.8.0...0.8.1