Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

Added

support lingua based for language detection (#65)

Removed

Python 3.7 support

3.0.0 - 2023-10-11

Added

opusfilter-autogen script for automatic filter config generation
score_direction, accept_threshold, and reject_threshold properties for filters

Changed

refactor code and move auxiliary methods to opusfilter.util
update varikn installation instructions (installable from PyPI)
update github workflows and include Python 3.11 tests
update library version requirements to support Python 3.11
use xxhash instead of pyhash for hash functions
use opus-fast-mosestokenizer instead of fast-mosestokenizer
install eflomal from PyPI and use the new interface in WordAlignFilter

Removed

Python 3.6 support

Fixed

catch NotImplementedError from beautifulsoup 4.11.2
catch ParserRejectedMarkup from beautifulsoup 4.12.0

2.6.0 - 2022-11-30

Added

add slice missing from the enabled steps

Changed

improve documentation
import slow libraries only when needed
use chunks for the filter method of SentenceEmbeddingFilter
change RepetitionFilter to use single score for consistency with the threshold

Fixed

allow float thresholds for AverageWordLengthFilter
remove unnecessary code from RegExpSub
add setuptools version requirement

2.5.1 - 2022-09-28

Fixed

add missing document file

2.5.0 - 2022-09-28

Added

map_space_to option for Jieba and MeCab tokenizers to preserve existing space characters in input
parallel processing options for filter, score, and preprocess steps

Changed

re-organize documentation and support building it with sphinx

Fixed

catch TypeError exceptions from BeautifulSoup in HtmlTagFilter

2.4.0 - 2022-04-05

Added

an option to write filter scores to a file with opusfilter-test
new filters: AlphabetRatioFilter, RegExpFilter, SimilarityFilter, SentenceEmbeddingFilter
support for Japanese word segmentation using MeCab as a tokenizer
preprocessing methods for subword segmentation (BPESegmentation, MorfessorSegmentation)
subword segmentation support for the n-gram language models and language model filters

Changed

allow per-language parameters for LengthFilter, LengthRatioFilter, LongWordFilter, and AverageWordLengthFilter
fix documentation for train_aligment parameters

2.3.1 - 2022-01-28

Fixed

fix bug in classifier training without development set

2.3.0 - 2022-01-18

Added

new OpusFilterRuntimeError exception for having e.g. empty training data
option to save scores from the training data when creating word aligment priors
RepetitionFilter for filtering segments with repeated substrings
new preprocessor for sentence splitting monolingual data
method-specific options for LanguageIDFilter
chunksize option to the common section
LMClassifierFilter for classification based on n-gram language models

Changed

add workdir attribute to the FilterABC base class and change that the filters should use it for any file parameters
increase default chunksize in FilterPipeline from 10000 to 100000
refactor and clean up code

2.2.0 - 2021-11-23

Added

support for Chinese word segmentation using jieba as a tokenizer (#27)

2.1.2 - 2021-11-11

Fixed

fix wrong keyword argument name in opusfilter-duplicates

2.1.1 - 2021-10-19

Changed

move "How to contribute" to docs/CONTRIBUTING.md

Fixed

fix setuptools requirement (#21)
fix version requirement for pandas (>=1.0.0)

2.1.0 - 2021-08-31

Changed

replace PyYAML with ruamel.yaml

Added

support for variables in the YAML configuration (#13)
support to fasttext based for language detection (#20)
suppress_prompts parameter for opus_read (#19)
download and write steps
"How to contribute" section to README.md
changelog
bibliography and improved references

2.0.0 - 2021-06-01

Changed

extend to n-lingual parallel data instead of just bilingual data
switch tokenizer to fast-mosestokenizer

Added

new commands: opusfilter-diagram, opusfilter-duplicates, opusfilter-test
new filters: LongestCommonSubstringFilter, AverageWordLengthFilter
new steps: preprocess
set "latest" as the default corpus release for opus_read (#5)
overlap option for remove_duplicates
lower threshold option for CrossEntropyFilter
github CI workflow for flake8 and unittests

Fixed

behaviour of simple filters on empty segments

1.0.1 - 2020-05-25

Added

improved logging, documentation, and project files

Fixed

prevent UnboundLocalError for empty output after filter

1.0.0 - 2020-04-10

First tagged version.