Releases · webrecorder/browsertrix-crawler

09 Oct 21:05

ikreymer

v0.12.0-beta.1

9ae297c

Browsertrix Crawler 0.12.0 Beta 1 Pre-release

Pre-release

What's Changed

Store crawler start and end times in Redis lists by @tw4l in #397
additional failure logic: by @ikreymer in #402
tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
Fast cancelation + remove time counter by @ikreymer in #406

Full Changelog: v0.12.0-beta.0...v0.12.0-beta.1

Contributors

ikreymer and tw4l

Assets 2

02 Oct 21:40

ikreymer

v0.12.0-beta.0

f453dbf

Browsertrix Crawler 0.12.0 Beta 0 Pre-release

Pre-release

Switching to Brave from Chrome/Chromium!

What's Changed

Switch to Brave Base Image by @ikreymer in #400

Full Changelog: v0.11.2...v0.12.0-beta.0

Contributors

ikreymer

Assets 2

29 Sep 18:54

ikreymer

v0.11.2

4c7ebf1

Browsertrix Crawler 0.11.2

What's Changed

more logging improvements by @ikreymer in #389
additional fixes for worker getting stuck by @ikreymer in #396
Update README.md by @gitreich in #390
Set new logic for invalid seeds by @tw4l in #395

New Contributors

@gitreich made their first contribution in #390

Full Changelog: v0.11.1...v0.11.2

Contributors

ikreymer, tw4l, and gitreich

Assets 2

19 Sep 03:45

ikreymer

v0.11.1

c6cbbc1

Browsertrix Crawler 0.11.1

Bug Fix Release

Should fix a few issues related to crawls getting stuck and not continuing and/or screencast stopping after a while, including:

Detecting 'page crash' events and logging them
Detecting 'browser crash' events and interrupting crawl (after saving state / ensuring data is written to WARCs)

What's Changed

favicon: use 127.0.0.1 instead of localhost by @ikreymer in #384
Error handling fixes to avoid crawler getting stuck. by @ikreymer in #385
Update CI Release Action by @ikreymer in #386

Full Changelog: v0.11.0...v0.11.1

Contributors

ikreymer

Assets 2

15 Sep 18:28

ikreymer

v0.11.0

debfe89

Browsertrix Crawler 0.11.0

New Features

Store favicon urls as favIconUrl in pages.jsonl
Support for filtering sitemap by date (from specified date)
Link extraction optimizations
Behaviors only run after page is fully loaded and links extraction has finished, previously autoplay/autofetch would start right away.

What's Changed

link extraction optimization: for scopeType page, set depth == extraH… by @ikreymer in #364
improve exit features: individual instance exit + exit code for interrupt by @ikreymer in #366
feat: precommit by @Chickensoupwithrice in #363
Capture Favicon by @Chickensoupwithrice in #362
logging: resolve confusion with 'crawl done' not being written to log… by @ikreymer in #375
logging fixes: avoid duplicate logging for same error by @ikreymer in #377
Surface lastmod option for sitemap parser by @ghukill in #367
Add example of mounting custom behaviours by @Chickensoupwithrice in #369
various fixes regarding state restart: by @ikreymer in #370
status: fix typo setting status to log message by @ikreymer in #379
Add option to output stats file live, i.e. after each page crawled by @benoit74 in #374
behavior logging tweaks, add netIdle by @ikreymer in #381
Update tldextract cache for pywb during build by @vnznznz in #383
Enhance file stats test to detect file modification by @benoit74 in #382
optimize link extraction: (fixes #376) by @ikreymer in #380

New Contributors

@Chickensoupwithrice made their first contribution in #363
@ghukill made their first contribution in #367
@benoit74 made their first contribution in #374
@vnznznz made their first contribution in #383

Full Changelog: v0.10.4...v0.11.0

Contributors

ikreymer, ghukill, and 3 other contributors

Assets 2

23 Aug 00:22

ikreymer

v0.10.4

cf404ef

Browsertrix Crawler 0.10.4

Bug fix release

What's Changed

args parsing: fix parseRx() for inclusions/exclusions to deal with no… by @ikreymer in #353
mark for upload-and-delete when crawl is interrupted for any limit: by @ikreymer in #354
improve crawl stopped check with unified isCrawlRunning() check with … by @ikreymer in #356

Full Changelog: v0.10.3...v0.10.4

Contributors

ikreymer

Assets 2

08 Aug 17:24

ikreymer

v0.10.3

16751de

Browsertrix Crawler 0.10.3

What's Changed

Fix for sizeLimit: only delete local data if a WACZ has been uploaded by @ikreymer in #347
seed parsing: return null if invalid url encountered in parseUrl to a… by @ikreymer in #349

Full Changelog: 0.10.2...v0.10.3

Contributors

ikreymer

Assets 2

06 Jul 20:11

ikreymer

0.10.2

442f448

Browsertrix Crawler 0.10.2

What's Changed

Fix disk utilization computation errors by @tw4l in #338
profiles: use newly provided puppeteer page.setBypassServiceWorker() … by @ikreymer in #340
Bump browsertrix-behaviors to ^0.5.1 by @tw4l in #341
Allow configuration of deduplication policy by @wvengen in #332
feat: Add custom behavior injection by @lambdahands in #285

New Contributors

@wvengen made their first contribution in #332
@lambdahands made their first contribution in #285

Full Changelog: 0.10.1...0.10.2

Contributors

wvengen, ikreymer, and 2 other contributors

Assets 2

31 May 02:39

ikreymer

0.10.1

c7dc504

Browsertrix Crawler 0.10.1

What's Changed

Ignore spaces in double quotes when splitting process.env.CRAWL_ARGS by @tw4l in #323
Origin Overrides: Ensure Host header also set by @ikreymer in #326
deps: update puppeteer-core to 20.4.0, fixes #324 by @ikreymer in #325

Full Changelog: 0.10.0...0.10.1

Contributors

ikreymer and tw4l

Assets 2

23 May 19:48

ikreymer

0.10.0

db46cdf

Browsertrix Crawler 0.10.0

Major Changes

Switch back to Puppeteer from Playwright due to memory issues (#298)
Internal: redis key {crawl_id}:d now a number of pages done instead of a list of pages done
Using Chrome 112 for Crawling
Can combine predefined crawl scopes with additional include options

What's Changed

Add option to log errors to redis by @tw4l in #279
Store done in redis as integer and only save full json in redis for failed pages by @tw4l in #284
worker: lower wait time, in case where no additional pages remain and… by @ikreymer in #289
Store archive dir size in Redis by @tw4l in #291
origin override: add --originOverride source=dest to allow routing wh… by @ikreymer in #281
Quick exit on redis connection error after interrupt by @ikreymer in #292
Fixes from 0.9.1 by @ikreymer in #297
Switch back to Puppeteer from Playwright by @ikreymer in #301
Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed by @tw4l in #300
crawl stopping / additional states: by @ikreymer in #303
Log fatal messages to redis errors by @tw4l in #305
Consolidate wacz error loglines by @tw4l in #306
state: adjust redis keys to be more consistent by @ikreymer in #309
Disable Chrome optimization logic by @malemburg in #312
stopping: if crawl is marked as stopping, and no warcs found, mark st… by @ikreymer in #314
Improve thumbnails with sharp by @tw4l in #304
Chrome 112 + new headless mode + consistent viewport tweaks by @ikreymer in #316
allow adding --include with pre-existing --scopeType values (besides … by @ikreymer in #319

New Contributors

@malemburg made their first contribution in #312

Full Changelog: 0.9.1...0.10.0

Contributors

ikreymer, malemburg, and tw4l

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

Bug Fix Release

What's Changed

Contributors

New Features

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

Major Changes

What's Changed

New Contributors

Contributors

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler 0.12.0 Beta 1

What's Changed

Contributors

Browsertrix Crawler 0.12.0 Beta 0

What's Changed

Contributors

Browsertrix Crawler 0.11.2

What's Changed

New Contributors

Contributors

Browsertrix Crawler 0.11.1

Bug Fix Release

What's Changed

Contributors

Browsertrix Crawler 0.11.0

New Features

What's Changed

New Contributors

Contributors

Browsertrix Crawler 0.10.4

What's Changed

Contributors

Browsertrix Crawler 0.10.3

What's Changed

Contributors

Browsertrix Crawler 0.10.2

What's Changed

New Contributors

Contributors

Browsertrix Crawler 0.10.1

What's Changed

Contributors

Browsertrix Crawler 0.10.0

Major Changes

What's Changed

New Contributors

Contributors