Releases: webrecorder/browsertrix-crawler
Releases · webrecorder/browsertrix-crawler
Browsertrix Crawler 0.12.0 Beta 1
What's Changed
- Store crawler start and end times in Redis lists by @tw4l in #397
- additional failure logic: by @ikreymer in #402
- tests: disable ad-block tests: seeing inconsistent ci behavior by @ikreymer in #407
- Fast cancelation + remove time counter by @ikreymer in #406
Full Changelog: v0.12.0-beta.0...v0.12.0-beta.1
Browsertrix Crawler 0.12.0 Beta 0
Browsertrix Crawler 0.11.2
Browsertrix Crawler 0.11.1
Bug Fix Release
Should fix a few issues related to crawls getting stuck and not continuing and/or screencast stopping after a while, including:
- Detecting 'page crash' events and logging them
- Detecting 'browser crash' events and interrupting crawl (after saving state / ensuring data is written to WARCs)
What's Changed
- favicon: use 127.0.0.1 instead of localhost by @ikreymer in #384
- Error handling fixes to avoid crawler getting stuck. by @ikreymer in #385
- Update CI Release Action by @ikreymer in #386
Full Changelog: v0.11.0...v0.11.1
Browsertrix Crawler 0.11.0
New Features
- Store favicon urls as
favIconUrl
in pages.jsonl - Support for filtering sitemap by date (from specified date)
- Link extraction optimizations
- Behaviors only run after page is fully loaded and links extraction has finished, previously autoplay/autofetch would start right away.
What's Changed
- link extraction optimization: for scopeType page, set depth == extraH… by @ikreymer in #364
- improve exit features: individual instance exit + exit code for interrupt by @ikreymer in #366
- feat: precommit by @Chickensoupwithrice in #363
- Capture Favicon by @Chickensoupwithrice in #362
- logging: resolve confusion with 'crawl done' not being written to log… by @ikreymer in #375
- logging fixes: avoid duplicate logging for same error by @ikreymer in #377
- Surface lastmod option for sitemap parser by @ghukill in #367
- Add example of mounting custom behaviours by @Chickensoupwithrice in #369
- various fixes regarding state restart: by @ikreymer in #370
- status: fix typo setting status to log message by @ikreymer in #379
- Add option to output stats file live, i.e. after each page crawled by @benoit74 in #374
- behavior logging tweaks, add netIdle by @ikreymer in #381
- Update tldextract cache for pywb during build by @vnznznz in #383
- Enhance file stats test to detect file modification by @benoit74 in #382
- optimize link extraction: (fixes #376) by @ikreymer in #380
New Contributors
- @Chickensoupwithrice made their first contribution in #363
- @ghukill made their first contribution in #367
- @benoit74 made their first contribution in #374
- @vnznznz made their first contribution in #383
Full Changelog: v0.10.4...v0.11.0
Browsertrix Crawler 0.10.4
Bug fix release
What's Changed
- args parsing: fix parseRx() for inclusions/exclusions to deal with no… by @ikreymer in #353
- mark for upload-and-delete when crawl is interrupted for any limit: by @ikreymer in #354
- improve crawl stopped check with unified isCrawlRunning() check with … by @ikreymer in #356
Full Changelog: v0.10.3...v0.10.4
Browsertrix Crawler 0.10.3
What's Changed
- Fix for sizeLimit: only delete local data if a WACZ has been uploaded by @ikreymer in #347
- seed parsing: return null if invalid url encountered in parseUrl to a… by @ikreymer in #349
Full Changelog: 0.10.2...v0.10.3
Browsertrix Crawler 0.10.2
What's Changed
- Fix disk utilization computation errors by @tw4l in #338
- profiles: use newly provided puppeteer page.setBypassServiceWorker() … by @ikreymer in #340
- Bump browsertrix-behaviors to ^0.5.1 by @tw4l in #341
- Allow configuration of deduplication policy by @wvengen in #332
- feat: Add custom behavior injection by @lambdahands in #285
New Contributors
- @wvengen made their first contribution in #332
- @lambdahands made their first contribution in #285
Full Changelog: 0.10.1...0.10.2
Browsertrix Crawler 0.10.1
Browsertrix Crawler 0.10.0
Major Changes
- Switch back to Puppeteer from Playwright due to memory issues (#298)
- Internal: redis key {crawl_id}:d now a number of pages done instead of a list of pages done
- Using Chrome 112 for Crawling
- Can combine predefined crawl scopes with additional include options
What's Changed
- Add option to log errors to redis by @tw4l in #279
- Store done in redis as integer and only save full json in redis for failed pages by @tw4l in #284
- worker: lower wait time, in case where no additional pages remain and… by @ikreymer in #289
- Store archive dir size in Redis by @tw4l in #291
- origin override: add --originOverride source=dest to allow routing wh… by @ikreymer in #281
- Quick exit on redis connection error after interrupt by @ikreymer in #292
- Fixes from 0.9.1 by @ikreymer in #297
- Switch back to Puppeteer from Playwright by @ikreymer in #301
- Catch 4xx and 5xx page.goto() responses to mark invalid URLs as failed by @tw4l in #300
- crawl stopping / additional states: by @ikreymer in #303
- Log fatal messages to redis errors by @tw4l in #305
- Consolidate wacz error loglines by @tw4l in #306
- state: adjust redis keys to be more consistent by @ikreymer in #309
- Disable Chrome optimization logic by @malemburg in #312
- stopping: if crawl is marked as stopping, and no warcs found, mark st… by @ikreymer in #314
- Improve thumbnails with sharp by @tw4l in #304
- Chrome 112 + new headless mode + consistent viewport tweaks by @ikreymer in #316
- allow adding --include with pre-existing --scopeType values (besides … by @ikreymer in #319
New Contributors
- @malemburg made their first contribution in #312
Full Changelog: 0.9.1...0.10.0