History

1.0.0

Extensive refactoring by @juanjoDiaz: - Series of modular classes - Different lemmatization strategies available - Customization of dictionary loading and handling (DictionaryFactory) - LanguageDetector class with extended options - See readme and [detailed documentation](https://adbar.github.io/simplemma/)

Breaking changes: - The extensive argument is now greedy - The langdetect submodule is now language_detector from simplemma.langdetect import ... → from simplemma.language_detector import ...

Fixes and improvements: - is_known() function now restored to its state in v0.9.0 (full dictionary) - More languages and better rules (with @juanjoDiaz) - Use binary strings in dictionaries to save memory - Dictionary sort before compression by @1over137

Documentation: - Classes and general doc pages by @juanjoDiaz - Section on classes in the readme by @osma

0.9.1

smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
unsupervised approach to affixes activated by default for some languages
reviewed rules for English and German (less greedy)
added rules for Dutch, Finnish, Polish and Russian
improved Russian and Ukrainian language data (#3)
improved tokenizer

0.9.0

smaller data files (especially for fi, la, pl, pt, sk & tr, #19)
added support for Asturian (ast, #20)
bug fixes (#18, #26)

0.8.2

languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
fix for slow language detection introduced in 0.7.0

0.8.1

better rules for English and German
inconsistencies fixed for cy, de, en, ga, sv (#16)
docs: added language detection and citation info

0.8.0

code fully type checked, optional pre-compilation with mypyc
fixes: logging error (#11), input type (#12)
code style: black

0.7.0

breaking change: language data pre-loading now occurs internally, language codes are now directly provided in lemmatize() call, e.g. simplemma.lemmatize("test", lang="en")
faster lemmatization, result cache
sentence-aware text_lemmatizer()
optional iterators for tokenization and lemmatization

0.6.0

improved language models
improved tokenizer
maintenance and code efficiency
added basic language detection (undocumented)

0.5.0

faster, more efficient code
dropped support for Python 3.5

0.4.0

new languages: Armenian, Greek, Macedonian, Norwegian (Bokmål), and Polish
language data reviewed for: Dutch, Finnish, German, Hungarian, Latin, Russian, and Swedish
Urdu removed of language list due to issues with the data
add support for Python 3.10 and drop support for Python 3.4
improved decomposition and tokenization algorithms

0.3.0

improved models and disambiguation
improved tokenization
extended rules for German

0.2.2

Work on decomposition rules
Reviewed language data
Cleaner code

0.2.1

Better decomposition into subwords by greedy algorithm
First benchmarks and data-based corrections: German, French, English, Spanish

0.2.0

Languages added: Danish, Dutch, Finnish, Georgian, Indonesian, Latin, Latvian, Lithuanian, Luxembourgish, Turkish, Urdu
Improved word pair coverage
Tokenization functions added
Limit greediness and range of potential candidates

0.1.0

First release on PyPI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HISTORY.rst

HISTORY.rst

History

1.0.0

0.9.1

0.9.0

0.8.2

0.8.1

0.8.0

0.7.0

0.6.0

0.5.0

0.4.0

0.3.0

0.2.2

0.2.1

0.2.0

0.1.0

Files

HISTORY.rst

Latest commit

History

HISTORY.rst

File metadata and controls

History

1.0.0

0.9.1

0.9.0

0.8.2

0.8.1

0.8.0

0.7.0

0.6.0

0.5.0

0.4.0

0.3.0

0.2.2

0.2.1

0.2.0

0.1.0