Skip to content

Latest commit

 

History

History
161 lines (133 loc) · 5.53 KB

CHANGELOG.md

File metadata and controls

161 lines (133 loc) · 5.53 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased:

Bicleaner AI 3.1.0:

Added:

  • bicleaner-ai-sample command to sample training data from multiple files/language pairs.

Bicleaner AI 3.0.1:

Changed:

  • Update hardrules to 2.10.4.

Bicleaner AI 3.0.0:

Added:

  • New multilingual full models: large-en-xx, es-xx, large-es-xx.
  • Options to overwrite source or target language in model config.
  • Generate training data with separated command.
  • More logging messages during noise generation.
  • Compress noise generation intermediate files.
  • Multilingual models training documentation.
  • Show a warning when no GPU/TPU has been detected.
  • Show a warning if some layers do not correctly.

Changed:

  • Huge improvements in accuracy multilingual full models.
  • Updated mutilingual full model en-xx.
  • Upload multilingual full models to HF.
  • Disable hardrules that need lang parameter when using multilingual model if language has not been overwritten.
  • Optional parallel_train, parallel_valid and noise generation in bicleaner_train.
  • Test classify full with HF model.
  • Re-organize documentation.
  • Updated Tensorflow.
  • Support for Python 3.11.
  • Updated Transformers to 4.36.

Fixed:

  • Always English being used in Tokenizer during noise generation.
  • Noise generation failing when freq_noise was disabled.
  • Accidentally deleting generated valid file at the end of training.

Bicleaner AI 2.3.2:

Fixed:

  • Fix compatibility with Hardrules metadata dict.

Bicleaner AI 2.3.1:

Added:

  • bicleaner-ai-download quiet mode.

Changed:

  • Update hardrules to 2.9.0, now hardrules accepts HF identifiers to load metadata.

Bicleaner AI 2.3.0:

Added

  • Full models download from HF now accept a local path to store the model instead of using HF cache.

Changed

  • Update HuggingFace Transformers and Hub.
  • Update documentation about downloading models and managing HF cache.

Fixed

  • Use a single worker to download from HF, fixes timeout errors during download.

Bicleaner AI 2.2.2:

Fixed

  • Fix '--rules_config' parameter.

Bicleaner AI 2.2.1:

Added

Changed

Fixed

  • Fix HF downloads in some slow docker instances with increased etag timeout.
  • Fix builds in PIP>=23 with new Hardrules and FastSpell.

Bicleaner AI 2.2.0:

Added

  • Support tokenizing by characters (useful for Chinese).
  • CLI option to configure minimum words/tokens to be omited/replaced.

Changed

  • Refactored Tokenizer class.
  • Update HF hub.
  • Removed external tokenizer option.
  • Improved begginers guide.

Fixed

  • Create only one Tokenizer object per process in noise function.

Bicleaner AI 2.1.0:

Added

Changed

  • Update Hardrules to 2.8.0
    • Better coverage of Icelandic langid
    • Updated KenLM installation instructions.

Fixed

  • KenLM installation.

Bicleaner AI 2.0:

Added

  • Upload full models to Hugging Face Hub.
  • Automatic download of full models.
  • Hide Tensorflow and Transformers logging messages in executable scripts.
  • Redirect Keras prediction progress bar to stderr.
  • Huge memory improvements during training.
  • Speed improvements using pading longest instead of max_length
  • Models are more insensitive to the presence of capital letter at the start of the sentence.
  • Improved performance on HBS Cyrillic transliterating in models which had poor training on cyrillic text.
  • Basic test suite.
  • Allow changing the base model for XLMR. Any XLMRoberta model can be used.

Changed

  • Migrate to pyproject.toml and src/ tree structure, comply with PEP517, PEP518 and PEP621.
  • Update to Hardrules 2.6
    • Rules can be parametrized with --rules_config config.yaml
    • Some rules have been refactored with better names.
    • --run_all_rules mode to run each rule instead of stoppping at first discard
    • Language identification with FastSpell
      • Better Serbo-Croatian and Slovene language detection.
    • Easier installation! Now KenLM comes pre-compiled.
  • Now BICLEANER_AI_THREADS environment variable controls the number of threads.
  • Update HF Transformers.
  • Update TensorFlow minimum version.
  • Removed glove-python dependency and use own custom compilation.
  • Improved download scripts, easier to install and use.
  • Set inter/intra_op parallelism to 0 by default.
  • Block size by default to 10k, a bit faster.
  • Faster noise generation for small datasets with lower block size.
  • Model argument can be provided with or without 'metadata.yaml'.
  • Add citation info to README.

Fixed

  • Avoid generating empty sentences in omit noise.
  • Restore capital letters at the beggining of the sentennce in frequency noise.
  • Retrocompatibility with older models.
  • Compatibility of glove with Python>=3.7.
  • Fix loading lite models in other Python versions than 3.8.
  • Fix unbound variable lm_stats.
  • Other minor fixes.

Bicleaner AI 1.0.1:

  • Update hardrules to 1.2: adds score only mode.

Bicleaner AI 1.0:

  • Bicleaner train changes:
    • Separate most of the training logic in the BaseModel class.
    • Re-factor synthetic noise build function.
    • Parallelize synthetic noise generation.
    • Add fuzzy matching noise and neighbour noise.
    • Add Decomposable Attention model.
    • Add Transkformer-like model.
    • Add XLMRoberta model.
  • Bicleaner classify changes:
    • Change old classifier by new neural models.
    • Move hardrules into a separate package.