I've seen things you people wouldn't believe. Roy Batty, The Preverticant
What's Changed
- Neural tools (Vecalign and Neural Document Aligner) integration by @cgr71ii in #235
- CI and tests updates and fixes by @cgr71ii in #238
- Range of paragraphs count using option
paragraphIdentification
by @lpla in #241 - Document pair output file by @aarongaliano in #242
- Update Bicleaner(-AI) submodules given new Bicleaner Hardrules by @lpla in #244
- Remove Linguacrawl from Bitextor by @aarongaliano in #248
- It is still compatible with Bitextor regarding the WARC format, but crawling management should be performed manually
- Metadata code refactorization by @cgr71ii in #245
- Now you can use compatible documents (like PDFs, TXTs, HTMLs) in the Bitextor input without encapsulating it into WARC or Prevertical formats! Check
directories
anddirectioriesFile
documentation, by @aarongaliano in #247 PDFprocessing
option (previouslyPDFextract
). Now it is a list that allows you to choose whether to use pdf2html, pdfextract or Apache Tika (new PDF processor), by @aarongaliano in #247- Now you can use warc2html (e.g. to process PDFs) with warc2text, by @aarongaliano in #247
- New Bitextor
multilang
option (if activated, warc2text will extract content in different languages from the same document), by @aarongaliano in #247 - New Bitextor argument
bicleanerExtraArgs
to pass extra arguments to Bicleaner(-AI) by @lpla in #250 - Add fastspell apt dependencies to Dockerfile by @aliciannz in #249
- Scikit 1.1.3 updated base dependency, including new models for dict-based docaligner model by @aarongaliano in #243
- New L2 normalization in TF-IDF translation-based document aligner by @lpla in #252
- Updated Python requirements, submodules, and documentation.
- Minor bug fixes and changes (including #253)
New Contributors
- @aarongaliano made their first contribution in #242
- @aliciannz made their first contribution in #249
Full Changelog: v8.2...v8.3
Notes
bitextor-v8.3.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.3.zip
tarball or cloning the repo v8.3
tag.
We will support Bitextor 8.x
branch until the next major version is released.