Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler 0.12.0 Beta 1

09 Oct 21:05
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Store crawler start and end times in Redis lists by @tw4l in #397
  • additional failure logic: by @ikreymer in #402
  • tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
  • Fast cancelation + remove time counter by @ikreymer in #406

Full Changelog: v0.12.0-beta.0...v0.12.0-beta.1

Browsertrix Crawler 0.12.0 Beta 0

02 Oct 21:40
f453dbf
Compare
Choose a tag to compare
Pre-release

Switching to Brave from Chrome/Chromium!

What's Changed

Full Changelog: v0.11.2...v0.12.0-beta.0

Browsertrix Crawler 0.11.2

29 Sep 18:54
Compare
Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.11.1...v0.11.2

Browsertrix Crawler 0.11.1

19 Sep 03:45
c6cbbc1
Compare
Choose a tag to compare

Bug Fix Release

Should fix a few issues related to crawls getting stuck and not continuing and/or screencast stopping after a while, including:

  • Detecting 'page crash' events and logging them
  • Detecting 'browser crash' events and interrupting crawl (after saving state / ensuring data is written to WARCs)

What's Changed

Full Changelog: v0.11.0...v0.11.1

Browsertrix Crawler 0.11.0

15 Sep 18:28
Compare
Choose a tag to compare

New Features

  • Store favicon urls as favIconUrl in pages.jsonl
  • Support for filtering sitemap by date (from specified date)
  • Link extraction optimizations
  • Behaviors only run after page is fully loaded and links extraction has finished, previously autoplay/autofetch would start right away.

What's Changed

New Contributors

Full Changelog: v0.10.4...v0.11.0

Browsertrix Crawler 0.10.4

23 Aug 00:22
cf404ef
Compare
Choose a tag to compare

Bug fix release

What's Changed

  • args parsing: fix parseRx() for inclusions/exclusions to deal with no… by @ikreymer in #353
  • mark for upload-and-delete when crawl is interrupted for any limit: by @ikreymer in #354
  • improve crawl stopped check with unified isCrawlRunning() check with … by @ikreymer in #356

Full Changelog: v0.10.3...v0.10.4

Browsertrix Crawler 0.10.3

08 Aug 17:24
Compare
Choose a tag to compare

What's Changed

  • Fix for sizeLimit: only delete local data if a WACZ has been uploaded by @ikreymer in #347
  • seed parsing: return null if invalid url encountered in parseUrl to a… by @ikreymer in #349

Full Changelog: 0.10.2...v0.10.3

Browsertrix Crawler 0.10.2

06 Jul 20:11
442f448
Compare
Choose a tag to compare

What's Changed

  • Fix disk utilization computation errors by @tw4l in #338
  • profiles: use newly provided puppeteer page.setBypassServiceWorker() … by @ikreymer in #340
  • Bump browsertrix-behaviors to ^0.5.1 by @tw4l in #341
  • Allow configuration of deduplication policy by @wvengen in #332
  • feat: Add custom behavior injection by @lambdahands in #285

New Contributors

Full Changelog: 0.10.1...0.10.2

Browsertrix Crawler 0.10.1

31 May 02:39
c7dc504
Compare
Choose a tag to compare

What's Changed

  • Ignore spaces in double quotes when splitting process.env.CRAWL_ARGS by @tw4l in #323
  • Origin Overrides: Ensure Host header also set by @ikreymer in #326
  • deps: update puppeteer-core to 20.4.0, fixes #324 by @ikreymer in #325

Full Changelog: 0.10.0...0.10.1

Browsertrix Crawler 0.10.0

23 May 19:48
Compare
Choose a tag to compare

Major Changes

  • Switch back to Puppeteer from Playwright due to memory issues (#298)
  • Internal: redis key {crawl_id}:d now a number of pages done instead of a list of pages done
  • Using Chrome 112 for Crawling
  • Can combine predefined crawl scopes with additional include options

What's Changed

  • Add option to log errors to redis by @tw4l in #279
  • Store done in redis as integer and only save full json in redis for failed pages by @tw4l in #284
  • worker: lower wait time, in case where no additional pages remain and… by @ikreymer in #289
  • Store archive dir size in Redis by @tw4l in #291
  • origin override: add --originOverride source=dest to allow routing wh… by @ikreymer in #281
  • Quick exit on redis connection error after interrupt by @ikreymer in #292
  • Fixes from 0.9.1 by @ikreymer in #297
  • Switch back to Puppeteer from Playwright by @ikreymer in #301
  • Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed by @tw4l in #300
  • crawl stopping / additional states: by @ikreymer in #303
  • Log fatal messages to redis errors by @tw4l in #305
  • Consolidate wacz error loglines by @tw4l in #306
  • state: adjust redis keys to be more consistent by @ikreymer in #309
  • Disable Chrome optimization logic by @malemburg in #312
  • stopping: if crawl is marked as stopping, and no warcs found, mark st… by @ikreymer in #314
  • Improve thumbnails with sharp by @tw4l in #304
  • Chrome 112 + new headless mode + consistent viewport tweaks by @ikreymer in #316
  • allow adding --include with pre-existing --scopeType values (besides … by @ikreymer in #319

New Contributors

Full Changelog: 0.9.1...0.10.0