Releases: estnltk/estnltk
Releases · estnltk/estnltk
EstNLTK 1.7.2
Release of version 1.7.2 || Installation || Changelog || Tutorials
EstNLTK 1.7.1
Release of version 1.7.1 || Installation || Changelog || Tutorials
EstNLTK 1.7.0
Release of version 1.7.0 || Installation || Changelog || Tutorials
Estnltk 1.6.2beta
[1.6.2-beta] - 2018-04-16
Changed
- Moved command line scripts for processing etTenTen and koondkorpus from
estnltk/corpus_processing
tocorpus_processing
; - The command line scripts for processing etTenTen and koondkorpus were remade in a way that they both use the JSON format of the version 1.6 for storing intermediate results;
- Restructured tutorials:
basic_nlp_toolchain.ipynb
was split into 7 separate tutorials and moved totutorials/nlp_pipeline
. Morphology and syntax-related tutorials were also move totutorials/nlp_pipeline
; - Indexing of
Text
andLayer
objects. - Banned equal spans in not ambiguous layers.
Added
- Functionality to store and query text objects in the Postgres database.
- Tagger
AddressGrammarTagger
to extract address information from text. - Tutorial demonstrating how to extract addresses from text using
AddressGrammarTagger
and store results in the Postgres database (tutorials/postgres/storing_text_objects_in_postgres.ipynb). - Module
parse_koondkorpus.py
, which can be used for loading texts from XML TEI files of the Estonian Reference Corpus as EstNLTK Text objects. The module was ported from the version 1.4.1.1 and improved upon. Improvements: default encoding is now 'utf-8', and there is a working option to preserve the original sentence and paragraph tokenization from the XML files; - Tutorial about loading XML TEI files with EstNLTK;
- Added more helpful scripts for processing large corpora (a script for random selection and clean-up of files);
- Added AdjectivePhraseTagger (ported from version 1.4.1.1);
- DisambiguatingTagger to disambiguate ambiguous layers.
- EnvelopingSpan to replace SpanList in enveloping layers.
- Attribute lists to hold and represent attribute values extracted from layers.
Estnltk 1.6.1beta
Changed
- Redesigned Tagger base class. The deprecated TaggerOld is also in use so far.
- Moved morphology-related modules from
estnltk/taggers/
toestnltk/taggers/morph/
; - Moved functions that convert between Vabamorf dicts and EstNLTK's Spans to
estnltk/taggers/morph/morf_common.py
; - Updated make_resolver: default parameters for morphological analysis are now taken from
morf_common.py
; - Updated SentenceTokenizer: base_sentence_tokenizer is now customizable (e.g. LineTokenizer can be used to split into sentences by newlines);
Added
- Finite grammar module and GrammarParsingTagger.
- New taggers GapTagger, EnvelopingGapTagger, PhraseTagger, SpanTagger and vocabulary reading methods for PhraseTagger and SpanTagger.
- Added command line scripts that can be used for processing etTenTen and Koondkorpus;
- Added JavaProcess (ported from version 1.4.1.1);
- Added ClauseSegmenter (ported from version 1.4.1.1). Layer 'clauses' can now be added to the Text object. Note: this adds Java dependency to the EstNLTK: Java SE Runtime Environment (version >= 1.8) must be installed into the system and available from the PATH environment variable;
- Added UserDictTagger, which can be used to provide dictionary-based post-corrections to morphological analyses;
Fixed
- Bugfix in PostMorphAnalysisTagger: postcorrections are no longer applied to empty spans;
- Bugfix in VabamorfTagger: layer_name can now be changed without running into errors;
- Fix in GTMorphConverter: added the missing disambiguation step. Clause annotations are now used to resolve the ambiguities related to conversion of sid, ksid, nuksid forms;
- SyntaxIgnoreTagger: improved detection of parenthesized acronyms;
- CompoundTokenTagger: improved detection of numbers with percentages;
Estnltk 1.6.0beta
updated .travis.yml
Estnltk 1.4.1.1
Changed
- Removed estner/estner.json file
- Removed unnecessary resource /maltparser/estnltkBasedDep2.mco
Fixed
- Fix encoding bug in event_tagger when runing tests on windows;
Estnltk 1.4.1
Added
- Improved NER performance using
__slots__
in estner data model; - Added
sent_tokenizer_for_koond.py
: a sentence tokenizer for processing 'koondkorpus' text files ( as found in http://ats.cs.ut.ee/keeletehnoloogia/estnltk/koond.zip ), which provides several post-processing fixes to known sentence-splitting problems; - Updated 'koondkorpus' processing scripts
teicorpus.py
andconvert_koondkorpus.py
: added the option to specify the encoding of the input files; - Added
terminalprettyprinter.py
module, which provides a pretty-printer method that can be used for graphically formatting annotated texts in terminal; - Added
gt_conversion.py
module that can be used for converting morphological analysis categories from Vabamorf's format to the Giellatekno's (gt) format; - Added basic support for syllable extraction
- Added EventTagger, KeywordTagger and RegexTagger and fixed basic Tagger API for creating new layers;
- Added adjective phrase tagger (marks fragments such as "väga hea" and "küllalt tore")
Changed
- Updated Temporal expression tagger's and Clause segmenter's jar files to Java version 1.8;
- A major change: re-implementation of syntactic parsing interface:
- pre-processing scripts of the the VISLCG3-based syntactic analyser were rewritten in Python to ensure platform-independent processing;
- "estnltk.syntax.tagger.SyntaxTagger" was reimplemented in two modules ("SyntaxPreprocessing" and "VISLCG3Pipeline"), and the modules were made available as a common pipeline in "estnltk.syntax.parsers.VISLCG3Parser";
- added a possibility to use custom rules in VISLCG3Parser, or to load rules from a custom location;
- updated MaltParser's model so that surface-syntactic labels are now also generated during the parsing;
- moved MaltParser-based syntactic analysis and VISLCG3-based syntactic analysis to a common interface; both parsers are now available in the module "estnltk.syntax.parsers";
- changed how syntactic information is stored in Text: syntactic analyses are now attached in a separate layer (and different layers are created for MaltParser's analyses and VISLCG3's analyses);
- added "estnltk.syntax.utils.Tree", which provides an interface for making queries over a syntactic tree, and allows to export syntactic analyses as nltk's DependencyGraphs and Trees;
- added methods for importing syntactically analysed Texts from CG3 and CONLL format files;
- Improved NounPhraseChunker: made it compatible with the new interface of syntactic parsing;
- Converted tutorials to jupyter notebooks to make them runnable and testable;
- Tested and validated tutorials;
Fixed
- Fix a bug in NER feature extraction module with python 3.4;
- Fix in MaltParser's interface: temporary files are now maintained in system specific temp files dir (to avoid permission errors);
- Updated Temporal expression tagger:
- fixed a TIMEX normalization bug: verb tense information is now properly used;
- improved TIMEX extraction: re-implemented phrase level joining to provide more accurate extraction of long phrases;
- Fixed osx installs;
- Updated Vabamorf to fix #55;
- Fixed too restrictive package dependencies;
Release 1.4.0
fixed long description
Release 1.3.0
Release 1.3.0