Skip to content

More Datasets & Prebuilt Models

Latest
Compare
Choose a tag to compare
@husnusensoy husnusensoy released this 22 Apr 00:57
· 156 commits to master since this release

New Features

  • Exceptional handling of emoji, hashtagand mentiontokens by word tokenizers. Refer to sadedegel config for details.
    • Options also into Text2Doc text to sadedegel Document converter
  • [Incomplete] HashVectorizer (Works far better than TfIdf or BM25 vectorization for majority of the prebuilt models)
  • unaryoption for idf

Datasets

We do keep adding new datasets with this new release. Refer to Dataset ReadMe for details.

  • Customer Review dataset
  • Telco (Turkcell) Sentiment dataset
  • Movie Sentiment dataset
  • Hotel Sentiment dataset
  • Categorized Product Sentiment dataset

Prebuilt Models

We do keep adding new prebuilt models with this new release. Refer to Prebuilt Model ReadMe for details.

  • Turkish Movie Review Sentiment Classification
  • Telco Brand Tweet Sentiment Classification
  • Turkish Customer Reviews Classification

Others

  • Lazy evaluation of word shapeproperty

Behavioural Changes

  • Significant behavior change on tokensproperty. Previously property returns List[str], now List[Token]
  • Sentence Tokenizer is renamed to be Sentence Boundary Detector to prevent confusion with Word Tokenizer

Contribution