Releases: bitextor/bitextor
Bitextor 8.3: Snake Runner, the Sentence Retirer
I've seen things you people wouldn't believe. Roy Batty, The Preverticant
What's Changed
- Neural tools (Vecalign and Neural Document Aligner) integration by @cgr71ii in #235
- CI and tests updates and fixes by @cgr71ii in #238
- Range of paragraphs count using option
paragraphIdentification
by @lpla in #241 - Document pair output file by @aarongaliano in #242
- Update Bicleaner(-AI) submodules given new Bicleaner Hardrules by @lpla in #244
- Remove Linguacrawl from Bitextor by @aarongaliano in #248
- It is still compatible with Bitextor regarding the WARC format, but crawling management should be performed manually
- Metadata code refactorization by @cgr71ii in #245
- Now you can use compatible documents (like PDFs, TXTs, HTMLs) in the Bitextor input without encapsulating it into WARC or Prevertical formats! Check
directories
anddirectioriesFile
documentation, by @aarongaliano in #247 PDFprocessing
option (previouslyPDFextract
). Now it is a list that allows you to choose whether to use pdf2html, pdfextract or Apache Tika (new PDF processor), by @aarongaliano in #247- Now you can use warc2html (e.g. to process PDFs) with warc2text, by @aarongaliano in #247
- New Bitextor
multilang
option (if activated, warc2text will extract content in different languages from the same document), by @aarongaliano in #247 - New Bitextor argument
bicleanerExtraArgs
to pass extra arguments to Bicleaner(-AI) by @lpla in #250 - Add fastspell apt dependencies to Dockerfile by @aliciannz in #249
- Scikit 1.1.3 updated base dependency, including new models for dict-based docaligner model by @aarongaliano in #243
- New L2 normalization in TF-IDF translation-based document aligner by @lpla in #252
- Updated Python requirements, submodules, and documentation.
- Minor bug fixes and changes (including #253)
New Contributors
- @aarongaliano made their first contribution in #242
- @aliciannz made their first contribution in #249
Full Changelog: v8.2...v8.3
Notes
bitextor-v8.3.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.3.zip
tarball or cloning the repo v8.3
tag.
We will support Bitextor 8.x
branch until the next major version is released.
Bitextor 8.2: Snow White and the Hunspell
I told you to run. , The Huntsman
What's Changed
- Prevertical2text integration by @cgr71ii in #223
- Paragraph identification by @cgr71ii in #225
- Change default sentence splitter (now it is Loomchild's Segment) and Bicleaner AI integration by @cgr71ii in #226
- Use headers for descriptive column names in TSV input/output files by @cgr71ii in #227
- Add pip optional dependencies by @cgr71ii in #229
Full Changelog: v8.1.1...v8.2
Notes
bitextor-v8.2.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.2.zip
tarball or cloning the repo v8.2
tag.
We will support Bitextor 8.x
branch until the next major version is released.
v8.1.1
- Added support for Fedora installation. Check INSTALL.md for
dnf
commands. - Fixed
tests/run-tests.sh
to run those tests in both sequential (low resource server, using bash variableCI="true"
) or parallel. - Removed default file type filter in
wget
crawler, as it has issues with URLs without extension. - Bicleaner model training and dictionary generation options reworked:
bicleaner
will enable or disable Bicleaner, andbicleanerModel
will contain the path to the model.- Bicleaner model training will need to be explicitly enabled with
bicleanerGenerateModel
instead of checking out if the model provided throughbicleanerModel
config setting exists or not. - Dictionary generation will need to be set through
generateDic
instead of checking out whether the dictionary exists or not.
- Updated Python requirements.
- Minor bug fixes.
Notes
bitextor-v8.1.1.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.1.1.zip
tarball or cloning the repo v8.0
tag.
We will support Bitextor 8.x
branch until the next major version is released.
The Lost Word: Jurassic Warc (PG-8.1 rating)
"Oh my God! A snake! Help me!", Dr. Robert Burke
v8.1 Changelog
- Major rework on paths and installation folders to allow Bitextor to be installed in a specific location
- Check out installation instructions and details in INSTALL.md
- Replaced Tensorflow and Keras in the dictionary-based document aligner with scikit-learn
- General clean up of Python code
- Updated submodules and Python requirements versions
Notes
bitextor-v8.1.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.1.zip
tarball or cloning the repo v8.0
tag.
We will support Bitextor 8.x
branch until the next major version is released.
v8.0.1
- Deferred crawling standoff annotation reconstruction script has been rewritten for better performance
- This one benefits from LRU dict as a limited-size hash memory-based cache
- Uses native warcio and Moses sentence splitter (Python port)
- Fix
bitextor-buildTMX.py
dedup option- Dedup was keeping sentences strings from the best score from Bifixer, but the other columns from the last occurrence (url, deferred crawling standoff annotation, bicleaner score...)
- Bitextor now validates if a provided host is not valid
- Updated submodules
warc2text
removed URLs lowercasing
- Added more tests to the CI, including Bitextor with deferred crawling standoff annotation and its reconstruction.
- Updated requirements and submodules to their latest stable version.
Notes
bitextor-v8.0.1.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.0.1.zip
tarball or cloning the repo v8.0.1
tag.
We will support Bitextor 8.x
branch until the next major version is released.
Kill Bill-ingual: Vol. 8
"We have unfinished business.", Beatrix
v8.0 Changelog
- Deep rewrite of Bitextor Snakefile for a vast performance improvement.
- Some config parameters and intermediate generated files also changed, so reusing old config files and transient or permanent folders from old runs would introduce issues.
- Snakemake project structure now matches the standard.
- Now we are listed on a comprehensive catalog of standards compliant, public, Snakemake workflows from the official Snakemake developers
- All features from previous Bitextor version work.
- Machine translation system training now should be performed manually.
- Added a new crawler: linguacrawl, specialized in full TLD crawling.
- Added a new method for deferred crawling only using Murmurhash hashes at the sentence alignment step.
- A reconstructor is also provided:
deferred-annotation-reconstructor.sh
- A reconstructor is also provided:
- Added sharding, which groups domains into 1 GB shards for a more balanced job running, done via giashard (Golang Internet Archive SHARDing).
- A new WARC processor has been implemented in C++: warc2text
- It is faster than the previous text extraction tool
giawarc
(now deprecated) andwarc2preprocess
. - Although it has the same features as giawarc, it still lacks features like PDF processing or boilerplate removal that are available in
warc2preprocess
.
- It is faster than the previous text extraction tool
- Multiple improvements to
bitextor-warc2htmlwarc.py
andbitextor-warc2preprocess.py
:- Added
lxml
text extraction parsing library option, andhtml5lib
as optional and additional parsinghtml5lib
is the cleanest supported parser but also the slowest
- Deleted
alcazar
as all code and references from upstream vanished. - Fixed ‘simple’ text extraction parser for some table tags and new HTML5 tags.
ftfy
is now disabled by default.
- Added
- New translation based document aligner written in C++ (
document-aligner
folder)- Faster and less memory requirements than the previous Python code.
- Moses tokenizers are now used by default through an efficient wrapper.
- This will run by default if "wordTokenizers" is not defined in Bitextor configuration.
- This is the recommended option if your language is supported by Moses.
- Moses sentence splitter original script has been replaced with a faster port by Mediacloud.
- This will run by default if "sentenceSplitters" is not defined in Bitextor configuration.
- This is the recommended option if your language is supported by the latest Moses release version of the sentence splitter script.
- Added support for Biroamer
- Deprecated autotools and replaced them with CMake.
- Refactored and updated requirements and submodules for lots of performance and security improvements.
- Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in requirements.txt before installing them.
- Updated Snakemake to v6.0.5:
- Refactored bleualign-cpp code to improve efficiency and memory requirements.
- pdf-extract now processes text with sentence-join (consult Bitextor documentation for instructions)
- Deleted old and deprecated files and folders, like
slurm
,nmt
workflow for MarianNMT orpdf-extract
(replaced by wrappers in WARC processors).
- General system stability improvements to enhance the user's experience.
- Conda release builds are up.
- Docker builds have the same automatic build system, adding nightlies from Github master branch pushes (
edge
tag in Dockerhub). - Continuous integration has been activated through Github Actions.
- Discussions are now open in Github! Use them to chat about releases or topics that don't fit in issues section.
- Discord server is also up for a more live chat with other users and developers! Also there are some bots to keep you updated with some news about Bitextor development and related projects.
Notes
bitextor-v8.0.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.0.zip
tarball or cloning the repo v8.0
tag.
We will support Bitextor 8.x
branch until the next major version is released.
pre-8.0.0 Paracrawl release
v8.0.0-pre Changelog
- Deep rewrite of Bitextor Snakefile for a vastly performance improve.
- Still missing dictionary-based document aligner and
hunalign
options and rules, will be integrated soon. - We recommend revising Bitextor README.md to check new option naming or formats.
- Some intermediate files also changed, so reusing old runs would introduce issues.
- Still missing dictionary-based document aligner and
- Added sharding mode, which groups domains into 1 GB shards for a more balanced job running.
- It uses giashard tool.
- Added
lxml
text extraction parsing library option tobitextor-warc2htmlwarc.py" and
html5lib` optional and additional parsing.- This is needed for proper deferred crawling in newest Bitextor code.
- Deferred crawling is still only supported under
warc2preprocess
preprocessor.
- Deferred crawling is still only supported under
html5lib
is the cleanest supported parser (like a web browser) but also the slowest.
- This is needed for proper deferred crawling in newest Bitextor code.
- Fixed
simple
text extraction parser inbitextor-warc2preprocess.py
for some table tags and new HTML5 tags. ftfy
is now disabled by default.- Moses sentence splitter and tokenizer are now used by default through an efficient Python wrapper.
- This will happen if
wordTokenizers
andsentenceSplitters
are not defined. - This is the recommended option if your language is supported by these scripts.
- This will happen if
- Updated README.md.
- Refactored and updated requirements and submodules for lots of performance improvements.
- Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in
requirements.txt
before installing them. - Deferred crawling functions now can be easily imported.
- Refactored
bleualign-cpp
code.- Faster and less memory requirements.
- New translation based document aligner written in C++.
- Faster and less memory requirements than the previous Python code.
- New base64 scripts from
kpu/preprocess
andcache
fixes. - Bifixer now filters sentence pairs if one side has with more than 1024 characters.
- Now you can ignore the Python dependencies from modules you don't need to run by commenting those lines in
- General system stability improvements to enhance the user's experience.
Notes
Docker image will be updated once v8.0.0 gets released.
bitextor-v8.0.0-pre.zip
tarball does include submodules code, you still need to compile binaries like bleualign. If you start compiling the project after cloning from the git repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v8.0.0-pre.zip
tarball or cloning the repo v8.0.0-pre
tag.
v7.3.2
- Fixed
warc2htmlwarc.py
optional non-compressed output. - Fixed
bicleaner
andbifixer
cached call from Bitextor, improving performance. - Fixed paths in test files.
- Fixed
heritrix
waiting time while creating initial crawling files. - Fixed some deprecation errors from exceptions and old options.
- Fixed TMX and TXT deduplicated output, now writes first occurrence text of a deduplicated sentence.
- Fixed reproducibility issues using
bicleaner
cached call by creating a Bitextor optional parameter calledbicleanerCacheWithSents
. - Updated submodules to fix some bugs.
- Bifixer: fixed crash on empty segments.
- Bicleaner: version 0.13, less aggressive hardrules for short sentences (3-word sentences).
- Fixed
cld3
input inbitextor-warc2preprocess.py
, making most documents being detected as 'English'. - Fixed extracted text from
<span>
by adding a space after their content, in thewarc2preprocess
text extractorsimple
. - Updated some
requirements.txt
for security and dependency issues. - Updated latest docker image and tagged as
v7.3.2
.
Notes
We started integrating Bitextor 8.0 development branches into master
branch. If you don't need latest features but a more stable code, please use released versions/tags or the stable branch 7.x
.
bitextor-v7.3.2.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.3.2.zip
tarball or cloning the repo v7.3.2
tag.
We will support Bitextor 7.x
branch until Bitextor 8 is released.
v7.3.1
- Fixed example and test config files typos and new Bicleaner model filenames
- Fixed tilde paths (~ as
/home/user
) when used in config files - Fixed warcio HTTPHeader modification without recalculating content length (reported upstream for more details)
- Fixed
bitextor-warc2htmlwarc.py
stdin and stdout run mode.
Notes
bitextor-v7.3.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.3.zip
tarball or cloning the repo v7.3
tag.
We will support Bitextor 7.x
branch until Bitextor 8 is released.
Morty Python: Crawling Corpus, S07E03
"Always look on the end (side) of life", PEP-373
v7.3 Changelog
- Added support for Heritrix crawler and installation instructions.
- Added
plainTextHashes
option (incremental recrawling using mmh3). - WARC files read and write processes are standarized now (individually compressed records in gzip format).
- Integrated
cld3
in bothgiawarc
andwarc2preprocess
WARC processors, with optional install and use instructions. - Added several optional WARC preprocessing variables like
onlyPreprocessing
,preprocessLangs
andtargetLangs
to allow processing more than two languages in the same run.- This changed some basic variables type, like
wordTokenizers
andsentenceSplitters
and added new ones likereverseOutputPair
.
- This changed some basic variables type, like
- Added morphological analysers option (like Apertium) to improve document and
hunalign
sentence alignment. - Restructured the output and temporary files folders.
- We added
dataDir
as a folder with the data produced during WARC preprocessing step. - Preprocessing documents they are now in a file-per-language.
- We added
- Automated bicleaner moder training if not provided by the user.
- Updated README.md.
- Updated Docker image and added dockerfile.
- Updated requirements and submodules for lots of performance improvements.
- Dropped support of EOL Python 2.
- General system stability improvements to enhance the user's experience.
Notes
bitextor-v7.3.zip
tarball does include submodules code and binaries. If you start compiling the project after cloning from the repository, you need first to git submodule update --init --recursive
. Also, you can't issue this command on the source code .tar.gz
and .zip
packages generated by GitHub, so we recommend the bitextor-v7.3.zip
tarball or cloning the repo v7.3
tag.
We will support Bitextor 7.x
branch until Bitextor 8 is released.