Skip to content

Releases: GlobalMaksimum/sadedegel

More Datasets & Prebuilt Models

22 Apr 00:57
Compare
Choose a tag to compare

New Features

  • Exceptional handling of emoji, hashtagand mentiontokens by word tokenizers. Refer to sadedegel config for details.
    • Options also into Text2Doc text to sadedegel Document converter
  • [Incomplete] HashVectorizer (Works far better than TfIdf or BM25 vectorization for majority of the prebuilt models)
  • unaryoption for idf

Datasets

We do keep adding new datasets with this new release. Refer to Dataset ReadMe for details.

  • Customer Review dataset
  • Telco (Turkcell) Sentiment dataset
  • Movie Sentiment dataset
  • Hotel Sentiment dataset
  • Categorized Product Sentiment dataset

Prebuilt Models

We do keep adding new prebuilt models with this new release. Refer to Prebuilt Model ReadMe for details.

  • Turkish Movie Review Sentiment Classification
  • Telco Brand Tweet Sentiment Classification
  • Turkish Customer Reviews Classification

Others

  • Lazy evaluation of word shapeproperty

Behavioural Changes

  • Significant behavior change on tokensproperty. Previously property returns List[str], now List[Token]
  • Sentence Tokenizer is renamed to be Sentence Boundary Detector to prevent confusion with Word Tokenizer

Contribution

IcU Tokenizer & Better Vocabulary Structure

02 Apr 17:37
Compare
Choose a tag to compare

Sadedegel is now not only "An extraction based Turkish news summarizer", but rather "A General Purpose NLP library for Turkish"

News

We have added icu tokenizer as the default tokenizer (word tokenizer) which is very fast and accurate.

  • We have moved BERT as optional dependency which can be installed using pip install sadedegel[bert]
  • Word embeddings are introduced. To retrain use pip install sadedegel[w2v]
  • By making those dependencies optional sadedegel installation is now way faster than before
    • pip install sadedegel takes 3 minutes @40Mbps for version 0.18
    • pip install sadedegel takes 40 sec @40Mbps for version 0.19
  • Vocabulary files are now in hdf5 format. bert, icu and simple have their own vocabulary files.
    • Only icu vocabulary file includes word embeddings.
  • Relax dependencies (less strict module version coupling)

Feature Drop & Deprecation

Others

  • Pre-trained models under prebuilt are refreshed
    • They now use icu tokenizer
    • They now return class probabilities for predictions

More Prebuilt Models

17 Mar 09:59
Compare
Choose a tag to compare

0.18 adds more prebuilt models into sadedegel library

News

  • Our main contributor @dafajon has implemented a new BM25Summarizer similary to TfIdf summarizer. BM25Summarizer outperforms slightly in short summaries.

  • We have packaged two new prebuilt models (Refer to README for model accuracies )

    1. tweeter profanity classification (sadedegel.prebuilt.tweet_profanity)
    2. tweeter sentiment classification (sadedegel.prebuilt.tweet_sentiment)
  • Change the way we report summarizer performance. Instead of a grid search of summarizer options, we now use a RandomSearch to decide optimal summarizer and parameters. Refer to README for details.

Feature Drop & Deprecation

  • sents property on Doc is dropped. use __iter__(Doc) instead.
  • tf property on Doc is deprecated (will be dropped by 0.18) in favor of get_tf function which gives a more flexible way to access document level tf vectors.
  • tfidf function on Doc is deprecated (will be dropped by 0.18) in favor of get_tfidf function which gives a more flexible way to access document level tf-idf vectors.
  • lexrank external dependency is dropped and LexRankPureSummarizer is renamed to be LexRankSummarizer
  • set_config, get_config, describe_config and get_all_configs are dropped in favor of new configuration implementation.

Others

  • tf property is now a part of TfImpl class using default configuration settings to yield a tf vector for a Doc or Sentence
  • We've updated documentation for our datasets.
  • idf property is now a part of IdfImp class using default configuration settings to yield a idf vector for a Doc or Sentence
  • More default parameters in default.ini based on our summarizer performance.

Direction to General Purpose NLP Library for Turkish

17 Mar 09:40
Compare
Choose a tag to compare

0.17 release introduces several non summarisation related NLP capabilities in Sadegel

News

  • Starting with this release, sadedegel now ships prebuilt models for various basic NLP tasks. The purpose is to allow developers to load & use those models with minimal configuration.
    • Our first model is a news classifier (Thanks Taner Sezer for his corpus support)
  • We report accuracy of our tokenizers (word) for potential enhancement points in future releases (Thanks Taner Sezer for his corpus support)
  • To support the development of prebuilt models, sklearn compatiblle extension.sklearn module is introduced for feature engineering
  • Token.is_stopwordis added to flag stopword token types.
  • LexRankSummarizer (based on lexrank external module, to be deprecate in future releases) and LexRankPureSummarizer (pure sadedegel version of the same method) is added into set of extractive summarizers.

Feature Drop & Deprecation

  • sents property on Doc is dropped. use __iter__(Doc) instead.
  • tf property on Doc is deprecated (will be dropped by 0.18) in favor of get_tf function which gives a more flexible way to access document level tf vectors.
  • tfidf function on Doc is deprecated (will be dropped by 0.18) in favor of get_tfidf function which gives a more flexible way to access document level tf-idf vectors.

Others

  • We have pushed up TF and IDF implementations from Sentence and Doc to separate classes using python multiple inheritance support to reduce code duplication.

Minor Performance Enhancements & Tidy Up

07 Jan 23:13
Compare
Choose a tag to compare

In one month time we have added lots into sadedegel library.

News

  • We have resolved an old and major issue caused by improper from transformers import AutoTokenizer calls here and there and lazy loading sentence boundary detector (sbd). Just to given an idea:
    • sadedegel config CLI call to show sadedegel configuration took 11 sec in 0.16.1.1 release whereas 2 sec in 0.16.2.1+
    • from sadedegel import Doc call (which is usually the first one to start working with sadedegel) took 9.5 sec in 0.16.1.1 release whereas 1 sec in 0.16.2.1+

Feature Drop & Deprecation

  • Old configuration capabilities are deprecated (this time unfortunately without prior warnings in earlier releases)
    • DeprecationWarning is the indication that you do access one of such APIs which will completely be removed by 0.18
    • Please use new API config_context (tf_context and idf_context are just simplified wrappers)

Documentation

  • CONFIG.md details the configuration of sadedegel.

Others

  • __getitem__ function to access any token of a Sentence
  • Iterator on Sentence yields all Tokens in order.
  • default tf method is now log_norm instead of binary thanks to @dafajon's most recent summarizer experiments.

Config, Configuration, Configurable

07 Jan 22:38
Compare
Choose a tag to compare
Pre-release

This release is mainly devoted to centralized configuration. Lot's have changed, hopefully not but maybe broken (Always feel free to open an issue)

New Capabilities

  • New command for sadedegel CLI, sadedegel config to retrieve all possible configurations.
    • default values (sadedegel/default.ini) are shipped with sadedegel can be overwritten by creating a user defined config file in ~/.sadedegel/user.ini (overwritten values are indicated on sadedegel config output.)
  • Configurable tf and idf vectors per Sentence is ready with new configuration model.
  • We have finally implemented forward version of BandSummarizer explained in sadedegel Presentation

Internal Update

  • Previously sadedegel.Doc was a Python class which is initialized with a document (string), we have seen some caveats in this approach and now sadedegel.Doc is an instance of sadedegel.DocumentBuilder and without changing (hopefully !!!) end user experience what you do is to trigger __call__ function returning a sadedegel.Document instance.

One Big Release

09 Oct 19:57
Compare
Choose a tag to compare
One Big Release Pre-release
Pre-release

In one month time we have added lots into sadedegel library.

News

  • We have @doruktiktiklar as the first code contributor out of Global Maksimum AI team.

New Capabilities

  • ADD: Addition of Vocabulary and Token concepts into library
    • Token: singleton per word (case sensitive) to store unique token features (lower form, shape, document frequency, etc.)
    • New sadedegel-build-vocabulary to manage sadedegel vocabularies.

New Summarizers

  • ADD: TextRank Summarizer
    TextRank summarizer uses Google's PageRank algorithm based on distance/similarity defined by BERT embedding cosine distance/similarity (as of this release and more to come)
  • ADD: TFIDF Summarizer
    TFIDF Summarizer uses element sum of tfidf vector of a sentence as the relevance score of a sentence in a document.

Others

  • UPDATE: Some annotator consensus issues on summary corpus.
  • UPDATE: A better command-line for summarizer evaluation. Check sadedegel-summarize evaluate for more
  • ADD: Sentences level tf, idf and tfidf embeddings
  • ADD: Doc has tfidf_embeddings property similar to bert_embeddings property.

Documentation

Contribution Guidelines

  • ADD: Commit Guidelines
  • ADD: New Feature checklist

Feature Drop & Deprecation

  • DROP: Code quality guidelines is removed since Code Inspector limits the number of lines per open source project. We might continue with other providers later in the future.

  • DEPRECATED: Doc.sents will be removed by version 0.17

    • Use [i] to access ith sentences of a document
    • Doc object now implements __iter__ to let iterate over all sentences of a document.

Bugfix

  • Properly handle empty documents. Ex Doc("") or Doc('')

Jekyll based sadedegel Github pages

13 Sep 20:11
Compare
Choose a tag to compare
Pre-release
  • ADD: We have initialize sadedegel web page on Jekyll SadedeGel WebSite
  • ADD: Add hotfix contribution process for sadedegel into CONTRIBUTING.md
  • ADD: Sadedegel Slack channel.
  • ADD: Evaluation scores of new experimental tokenizer (Simple Tokenizer)

Regular Expression based Simple Word Tokenizer & Code Quality

05 Sep 21:38
Compare
Choose a tag to compare
  • ADD: Major change of this release is Simple word tokenizer implementation by @dafajon after seeing the issues with BERT Tokenizer. Note that simple tokenizer is still experimental and not compatible with all summarizers (Cluster based summarizer automatically switch to BERT Tokenizer in order to be able to utilize BERT embeddings)
  • ADD: Introduction of sadedgel.set_config to modify some sadedegel configurations. Such as word tokenizer.
  • ADD: tags are added to ExtractiveSummarizer in order to filter them out (in evaluation etc.) easily.
  • ADD: Thanks to Code Inspector sadedeGel is under constant code quality monitoring with an intial grade of A (Score 94). We will keep it high as much as we can as the capabilities of the library grows.
  • CHANGE: Downgrade sklearn dependency back to 0.23.1 to prevent serialization compatibility warnings.
  • CHANGE: Score normalization of summarizers push up to parent abstract class ExtractiveSummarizer, improving code quality by reducing repetitive code blocks.

Improving APIs & Add commandline entrypoints

15 Aug 23:02
Compare
Choose a tag to compare
  • ⚠️ CHANGE: We have changed Doc constructor. Use new from_sentences class method to construct a new Doc object using list of strings (representing sentences) resolves: #47

  • CHANGE: Sentences object now holds a reference to originating Doc object (Previously reference to Doc.sents) for more flexibility.

  • CHANGE: We have significantly standardized our summarizers (specifically cluster based summarizers) resolves: #59 Summarizers now allow following parameter types on predict and __call__ functions:

    • Doc
    • List[Sentences]
    • List[str] (each element is taken as a sentence)
  • ADD: We have completed documentation of sadedegel* commandlines' entrypoints

    • sadedegel
    • sadedegel-dataset
    • sadedegel-dataset-extended
    • sadedegel-summarize
    • sadedegel-sbd
    • sadedegel-server
  • FIX: sadedegel info returns Heroku Application address properly.

  • FIX: Fix memoization bug on Sentences.tokens_with_special_symbols providing 10% faster Sentences.tokens calls.