Releases · adbar/simplemma

31 May 10:21

adbar

v1.0.0

6860df6

simplemma-1.0.0 Latest

Latest

Extensive refactoring by @juanjoDiaz:

Series of modular classes
Different lemmatization strategies available
Customization of dictionary loading and handling (DictionaryFactory)
LanguageDetector class with extended options
See readme and detailed documentation

Breaking changes:

The extensive argument is now greedy
The langdetect submodule is now language_detector
from simplemma.langdetect import ... → from simplemma.language_detector import ...

Fixes and improvements:

is_known() function now restored to its state in v0.9.0 (full dictionary)
More languages and better rules (with @juanjoDiaz)
Use binary strings in dictionaries to save memory
Dictionary sort before compression by @1over137

Documentation:

Classes and general doc pages by @juanjoDiaz
Section on classes in the readme by @osma

Contributors

osma, juanjoDiaz, and 1over137

Assets 2

20 Jan 17:07

adbar

v0.9.1

07612fa

simplemma-0.9.1

What's Changed

smaller language data footprint with smallest possible impact on performance, using a combination of rules, upper limit on word length, and better data cleaning (#31)
unsupervised approach to affixes activated by default for some languages
reviewed rules for English and German (less greedy)
added rules for Dutch, Finnish, Polish and Russian
improved Russian and Ukrainian language data (#3)
improved tokenizer

Full Changelog: v0.9.0...v0.9.1

Assets 2

18 Oct 11:46

adbar

v0.9.0

f669295

simplemma-0.9.0

smaller data files (especially for fi, la, pl, pt, sk & tr, #19)
added support for Asturian (ast, #20)
bug fixes (#18, #26)

Assets 2

05 Sep 14:11

adbar

v0.8.2

8ff2546

simplemma-0.8.2

languages added: Albanian, Hindi, Icelandic, Malay, Middle English, Northern Sámi, Nynorsk, Serbo-Croatian, Swahili, Tagalog
fix for slow language detection introduced in 0.7.0

Full Changelog: v0.8.1...v0.8.2

Assets 2

01 Sep 12:14

adbar

v0.8.1

5b5ee9d

simplemma-0.8.1

better rules for English and German
inconsistencies fixed for cy, de, en, ga, sv (#16)
docs: added language detection and citation info

Full Changelog: v0.8.0...v0.8.1

Assets 2

02 Aug 15:37

adbar

v0.8.0

f356846

simplemma-0.8.0

code fully type checked, optional pre-compilation with mypyc
fixes: logging error (#11), input type (#12)
code style: black

Full Changelog: v0.7.0...v0.8.0

Assets 2

16 Jun 09:52

adbar

v0.7.0

6205180

simplemma-0.7.0

breaking change: language data pre-loading now occurs internally, language codes are now directly provided in lemmatize() call, e.g. simplemma.lemmatize("test", lang="en")
faster lemmatization and result cache
sentence-aware text_lemmatizer()
optional iterators for tokenization and lemmatization

Full Changelog: v0.6.0...v0.7.0

Assets 2

06 Apr 14:30

adbar

v0.6.0

406c4c9

simplemma-0.6.0

improved language models
improved tokenizer
maintenance and code efficiency
added basic language detection (undocumented)

Full Changelog: v0.5.0...v0.6.0

Assets 2

19 Nov 16:26

adbar

v0.5.0

c8866b5

simplemma-0.5.0

faster, more efficient code
dropped support for Python 3.5

Assets 2

19 Oct 16:44

adbar

v0.4.0

9e99770

simplemma-0.4.0

new languages: Armenian, Greek, Macedonian, Norwegian (Bokmål), and Polish
language data reviewed for: Dutch, Finnish, German, Hungarian, Latin, Russian, and Swedish
Urdu removed of language list due to issues with the data
add support for Python 3.10 and drop support for Python 3.4
improved decomposition and tokenization algorithms

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Contributors

What's Changed

Releases: adbar/simplemma

simplemma-1.0.0

Contributors

simplemma-0.9.1

What's Changed

simplemma-0.9.0

simplemma-0.8.2

simplemma-0.8.1

simplemma-0.8.0

simplemma-0.7.0

simplemma-0.6.0

simplemma-0.5.0

simplemma-0.4.0