Skip to content

Bitextor 8.3: Snake Runner, the Sentence Retirer

Latest
Compare
Choose a tag to compare
@lpla lpla released this 30 May 10:31
· 2 commits to master since this release

I've seen things you people wouldn't believe. Roy Batty, The Preverticant

What's Changed

  • Neural tools (Vecalign and Neural Document Aligner) integration by @cgr71ii in #235
  • CI and tests updates and fixes by @cgr71ii in #238
  • Range of paragraphs count using option paragraphIdentification by @lpla in #241
  • Document pair output file by @aarongaliano in #242
  • Update Bicleaner(-AI) submodules given new Bicleaner Hardrules by @lpla in #244
  • Remove Linguacrawl from Bitextor by @aarongaliano in #248
    • It is still compatible with Bitextor regarding the WARC format, but crawling management should be performed manually
  • Metadata code refactorization by @cgr71ii in #245
  • Now you can use compatible documents (like PDFs, TXTs, HTMLs) in the Bitextor input without encapsulating it into WARC or Prevertical formats! Check directories and directioriesFile documentation, by @aarongaliano in #247
  • PDFprocessingoption (previously PDFextract). Now it is a list that allows you to choose whether to use pdf2html, pdfextract or Apache Tika (new PDF processor), by @aarongaliano in #247
  • Now you can use warc2html (e.g. to process PDFs) with warc2text, by @aarongaliano in #247
  • New Bitextor multilangoption (if activated, warc2text will extract content in different languages from the same document), by @aarongaliano in #247
  • New Bitextor argument bicleanerExtraArgs to pass extra arguments to Bicleaner(-AI) by @lpla in #250
  • Add fastspell apt dependencies to Dockerfile by @aliciannz in #249
  • Scikit 1.1.3 updated base dependency, including new models for dict-based docaligner model by @aarongaliano in #243
  • New L2 normalization in TF-IDF translation-based document aligner by @lpla in #252
  • Updated Python requirements, submodules, and documentation.
  • Minor bug fixes and changes (including #253)

New Contributors

Full Changelog: v8.2...v8.3

Notes

bitextor-v8.3.zip tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive. Also, you can't issue this command on the source code .tar.gz and .zip packages generated by GitHub, so we recommend the bitextor-v8.3.zip tarball or cloning the repo v8.3 tag.

We will support Bitextor 8.x branch until the next major version is released.