Skip to content

Releases: OpenNMT/Tokenizer

Tokenizer 1.37.1

01 Mar 13:09
Compare
Choose a tag to compare

Fixes and improvements

  • Consider escaped characters as single characters in BPE
  • Ignore undefined scripts when resolving inherited or common scripts

Tokenizer 1.37.0

28 Feb 15:06
Compare
Choose a tag to compare

New features

  • Add tokenization option allow_isolated_marks to allow combining marks to appear isolated in the tokenization output in specific conditions

Fixes and improvements

  • Fix infinite loop when the text contains an invalid Unicode character
  • Fix segmentation fault when the BPELearner does not not find any pairs of characters in the tokenized data
  • [Python] Update ICU to 72.1

Tokenizer 1.36.0

13 Jan 15:05
Compare
Choose a tag to compare

New features

  • [Python] Add argument vocabulary in the Tokenizer constructor to set the vocabulary with a list of tokens instead of using a file
  • [Python] Add function pyonmttok.is_valid_language to check if a language code is valid and can be passed to the Tokenizer constructor

Tokenizer 1.35.0

06 Dec 10:41
Compare
Choose a tag to compare

New features

  • [Python] Add pickling support to pyonmttok.Vocab

Fixes and improvements

  • Update pybind11 to 2.10.1
  • Update cibuildwheel to 2.11.2

Tokenizer 1.34.0

13 Sep 09:31
Compare
Choose a tag to compare

Changes

  • [Python] Wheels are now built under manylinux2014 and requires pip >= 19.3 for installation

New features

  • [Python] Build wheels for Python 3.11

Fixes and improvements

  • Improve error handling when reading token frequencies in the vocabulary file
  • [Python] Fix possible crash when pyonmttok is imported before torch
  • [Python] Update ICU to 71.1
  • [C++] Fix static compilation with -DBUILD_SHARED_LIBS=OFF
  • [C++] Fix CMake warning when compiling the tests

Tokenizer 1.33.0

29 Aug 12:34
Compare
Choose a tag to compare

New features

  • [Python] Build ARM64 wheels for macOS

Fixes and improvements

  • [CLI] Fix error when the option --segment_alphabet is not set
  • Fix SentencePiece build warning when compiling with Clang

Tokenizer 1.32.0

25 Jul 09:56
Compare
Choose a tag to compare

New features

  • Add property pyonmttok.Vocab.counters to retrieve the number of occurrences of each token

Fixes and improvements

  • Update pybind11 to 2.10.0
  • Update cxxopts to 3.0.0

Tokenizer 1.31.0

07 Mar 10:10
Compare
Choose a tag to compare

New features

  • Add utilities to build and use vocabularies:
    • pyonmttok.Vocab
    • pyonmttok.build_vocab_from_tokens
    • pyonmttok.build_vocab_from_lines
  • Define the method Tokenizer.__call__ to simplify the tokenizer usage when additional features are unused:
tokens = tokenizer(text)

Fixes and improvements

  • Update pybind11 to 2.9.1

Tokenizer 1.30.1

25 Jan 15:59
Compare
Choose a tag to compare

Fixes and improvements

  • Fix deprecated languages codes in ICU that are incorrectly considered as invalid (e.g. "tl" for Tagalog)

Tokenizer 1.30.0

29 Nov 14:58
Compare
Choose a tag to compare

New features

  • [Python] Build wheels for AArch64 Linux

Fixes and improvements

  • [Python] Update ICU to 70.1