GOKTU_NLP - TURKISH NLP PREPROCESSING TOOLBOX

Note This toolbox is prepared for CMPE561 Natural Language Processing course given by Prof. Dr. Tunga Gungor in Boğaziçi University.

This Turkish NLP Preprocessing Toolbox have 5 functionalities:

Tokenizer
- Regex Based
- Logistic Regression Based
Sentence Splitter
- Regex Based
- Naive Bayesian Method Based
Normalizer
Stemmer
Stopword Eliminator

How to use toolbox?

Tokenizer

from nlp_preprocessing_toolbox.tokenizer import Tokenizer, TokenizerML

text = "..."

#for Regex Based tokenizer
tokenizer = Tokenizer()
tokenizer.setText(text)
tokenizer.run()
tokens = tokenizer.tokens

#for Logistic Regression based tokenizer
tokenizer = TokenizerML()
tokenizer.setText(text)
tokenizer.run()
tokens = tokenizer.tokens

Sentence Splitter

from nlp_preprocessing_toolbox.sentence_splitter import SentenceSplitter, SentenceSplitterML

text = "..."

#for Regex Based sentence splitter
sentSplit = SentenceSplitter()
sentSplit.setText(text)
sentSplit.run()
sentences = sentSplit.sentences

#for Naive Bayesian based sentence splitter
sentSplit = SentenceSplitterML()
sentSplit.setText(text)
sentSplit.run()
sentences = sentSplit.sentences

Normalizer

from nlp_preprocessing_toolbox.normalizer import Normalizer

words = [...] # a list of tokens

normalizer = Normalizer()
normalizer.setText(words)
normalized_tokens = normalizer.new_words

Stemmer

from nlp_preprocessing_toolbox.stemmer import Stemmer

words = [...] # a list of tokens

stemmer = Stemmer()
stemmer.setText(words)
stems = stemmer.stems

Stopword Eliminator

from nlp_preprocessing_toolbox.stopword_elimination import stopword_elimination

words = [...] # a list of tokens or stems

words = stopword_elimination(words)

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.ipynb_checkpoints		.ipynb_checkpoints
documents		documents
nlp_preprocessing_toolbox		nlp_preprocessing_toolbox
README.MD		README.MD
__init__.py		__init__.py
conllparse.py		conllparse.py
logreg.py		logreg.py
logres.ipynb		logres.ipynb
logres_new.ipynb		logres_new.ipynb
playground.py		playground.py
req.py		req.py
resources.txt		resources.txt
suffix_parser.py		suffix_parser.py
trial.py		trial.py
trial2.py		trial2.py

GoktugOcal/nlp-preprocessing-toolbox

Folders and files

Latest commit

History

Repository files navigation

GOKTU_NLP - TURKISH NLP PREPROCESSING TOOLBOX

How to use toolbox?

Tokenizer

Sentence Splitter

Normalizer

Stemmer

Stopword Eliminator

About

Topics

Resources

Stars

Watchers

Forks

Languages