Skip to content

Releases: bitextor/warc2text

v1.2.0

02 Feb 14:41
Compare
Choose a tag to compare

What's Changed

  • Add --robotspass shunt for records related to robots.txt by @jelmervdl in #43
  • Add --jsonl option by @jelmervdl in #35
  • warc2html changes by @ZJaume in #50
  • ZSTD compression and compression level support by @ZJaume in #51
  • Move JSONL output to --stdout and allow file-based output with JSONL by @ZJaume in #52

Full Changelog: v1.1.0...v1.2.0

v1.1.0: Merge pull request #36 from jelmervdl/fasttext-option

01 Aug 13:09
eac887e
Compare
Choose a tag to compare

Changes:

  • Add option to use a FastText model as a language identifier
  • Record identified by CLD2 as Unknown are classified as unk instead of dropped.

v1.0.0

01 Aug 13:08
673e371
Compare
Choose a tag to compare
Paragraph indexes now start in 1