Skip to content

Releases: webrecorder/browsertrix-crawler

Browsertrix Crawler v1.1.1

02 May 16:01
22b2136
Compare
Choose a tag to compare

What's Changed

  • Avoid crashes when editing / creating profile and navigation is interrupted
  • profiles: ensure all page.goto() promises have at least catch block/a… by @ikreymer in #559
  • profiles: ensure initial page.load() is awaited by @ikreymer in #561

Full Changelog: v1.1.0...v1.1.1

Browsertrix Crawler v1.1.0

19 Apr 04:57
15d2b09
Compare
Choose a tag to compare

Major Features

Support for QA Crawling (https://crawler.docs.browsertrix.com/user-guide/qa/)

What's Changed

  • QA Crawl Support (Beta) by @ikreymer in #469
  • Use RFC2606 invalid domain names by @vnznznz in #514
  • Fixes from 1.0.3 release -> main by @ikreymer in #517
  • Unify WARC writing + CDXJ indexing into single class by @ikreymer in #507
  • upgrade puppeteer-core to 22.6.1 by @ikreymer in #516
  • avoid cloudflare detection of puppeteer when using browser profiles: by @ikreymer in #518
  • add an extra --postLoadDelay param to specify how many seconds to wait after page-load by @ikreymer in #520
  • Gracefully handle non-absolute path for create-login-profile --filename by @tw4l in #521
  • Make /app world-readable to better support non-root usage by @vnznznz in #523
  • merge V1.0.4 change -> main: by @ikreymer in #527
  • Revert "Make /app world-readable to better support non-root usage" by @ikreymer in #529
  • ensure all warcwriter write operations go through a queue. by @ikreymer in #528
  • qa/replay crawl loading improvements by @ikreymer in #526
  • Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz by @ikreymer in #535
  • Adblock support by @ikreymer in #534
  • Remove no longer needed invalid Brave update URLs by @tw4l in #539
  • Better logging of all queue WARCWriter operations by @ikreymer in #536
  • qa: filter out non-html pages by @ikreymer in #541
  • Fix for --rolloverSize for individual WARCs in 1.x by @ikreymer in #542
  • Set mime type for html pages by @tw4l in #545
  • allow minio to connect to other regions by @mguella in #543
  • replay counts: don't filter out URLs with __wb_method to avoid dispar… by @ikreymer in #552
  • Add crawler QA docs by @tw4l in #551
  • Support site-specific wait via browsertrix-behaviors by @ikreymer in #555
  • warcinfo: fix version to 1.1 to avoid confusion (part of #553) by @ikreymer in #557

New Contributors

Full Changelog: v1.0.4...v1.1.0

Browsertrix Crawler 1.1.0 Beta 5

15 Apr 21:53
efebc33
Compare
Choose a tag to compare
Pre-release

What's Changed

  • Separate writing pages to pages.jsonl + extraPages.jsonl to use with new py-wacz by @ikreymer in #535
  • Adblock support by @ikreymer in #534
  • Remove no longer needed invalid Brave update URLs by @tw4l in #539
  • Better logging of all queue WARCWriter operations by @ikreymer in #536
  • qa: filter out non-html pages by @ikreymer in #541
  • Fix for --rolloverSize for individual WARCs in 1.x by @ikreymer in #542
  • Set mime type for html pages by @tw4l in #545

Full Changelog: v1.1.0-beta.4...v1.1.0-beta.5

v1.1.0-beta.4

05 Apr 01:14
c247189
Compare
Choose a tag to compare
v1.1.0-beta.4 Pre-release
Pre-release

What's Changed

  • Gracefully handle non-absolute path for create-login-profile --filename by @tw4l in #521
  • refactor handling of max size for html/js/css by @ikreymer in #525
  • merge V1.0.4 change -> main: by @ikreymer in #527
  • ensure all warcwriter write operations go through a queue. by @ikreymer in #528
  • qa/replay crawl loading improvements by @ikreymer in #526

Full Changelog: v1.1.0-beta.3...v1.1.0-beta.4

Browsertrix Crawler v1.0.4

03 Apr 22:23
a3f93ca
Compare
Choose a tag to compare

What's Changed

  • refactor handling of max size for html/js/css by @ikreymer in #525
    Fix for #522, issues loading pages with large streaming js/css

Full Changelog: v1.0.3...v1.0.4

Browsertrix Crawler 1.1.0 Beta 3 (QA Support)

29 Mar 00:21
Compare
Choose a tag to compare

What's Changed

  • Use RFC2606 invalid domain names by @vnznznz in #514
  • Fixes from 1.0.3 release -> main by @ikreymer in #517
  • Unify WARC writing + CDXJ indexing into single class by @ikreymer in #507
  • upgrade puppeteer-core to 22.6.1 by @ikreymer in #516
  • avoid cloudflare detection of puppeteer when using browser profiles: by @ikreymer in #518
  • add an extra --postLoadDelay param to specify how many seconds to wait after page-load by @ikreymer in #520

Full Changelog: v1.1.0-beta.2...v1.1.0-beta.3

Browsertrix Crawler 1.0.3

26 Mar 21:11
Compare
Choose a tag to compare

What's Changed

  • fixes redirected seed (from #475) being counted againt page limit: by @ikreymer in #509
  • sitemap improvements: gz support + application/xml + extraHops fix by @ikreymer in #511

Full Changelog: v1.0.2...v1.0.3

Browsertrix Crawler 1.1.0 Beta 2 (QA Crawl Support Beta)

23 Mar 05:11
Compare
Choose a tag to compare

What's Changed

  • Docs: Minor fixes to edit link & clarifications by @Shrinks99 in #501
  • Improved support for running as non-root by @ikreymer in #503
  • improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully by @ikreymer in #504
  • service worker capture fix: disable by default for now by @ikreymer in #506
  • QA Crawl Support (Beta) by @ikreymer in #469

New Contributors

Full Changelog: v1.1.0-beta.1...v1.1.0-beta.2

Browsertrix Crawler 1.0.2

22 Mar 20:38
22a7351
Compare
Choose a tag to compare

What's Changed

  • service worker capture fix: disable service workers by default for now, add cli option by @ikreymer in #506

Full Changelog: v1.0.1...v1.0.2

Browsertrix Crawler 1.0.1

21 Mar 20:58
93c3894
Compare
Choose a tag to compare

What's Changed

  • Docs: Minor fixes to edit link & clarifications by @Shrinks99 in #501
  • Improved support for running as non-root by @ikreymer in #503
  • improvements to 'non-graceful' interrupt to ensure WARCs are still closed gracefully by @ikreymer in #504

New Contributors

Full Changelog: v1.0.0...v1.0.1