Skip to content

smafjal/bengali_tokenizer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SentencePiece

SentencePiece was released from google to do text tokenization. It's an unsupervised text tokenizer for Neural Network-based text generation. I tried to make it easy to use for the Bengali language.

Usages

In Bengali language we generally do tokenization based on whitespace and punctuations (e.g , . ; |). But by using this unsupervised tokenization model we can generate tokens list that can be easily used in many language models like word2vec, XLM, BERT or LASER to get context from Bengali language story. One of my major concerns to use this model for my other NLP/NLU task.

How

  • Collect your Bengali raw data corpus & store it in a dir as pickle format.
  • Code base use "data/" as a corpus dir
  • Use main.py to train SentencePiece model
  • Change params from main.py file.
  • Trained model will save on "mode/" dir

Example

text = "বগুড়ায় জাতীয় লিগে দ্বিতীয় স্তরের ম্যাচে ঢাকা মেট্রোপলিসের হয়ে সেঞ্চুরি পেয়েছেন মাহমুদউল্লাহ।"
tokens = [ ▁বগুড়া , য় , ▁জাতীয় , ▁লিগে , ▁দ্বিতীয় , ▁স্তরের , ▁ম্যাচে , ▁ঢাকা , ▁মেট্রো , পলিস , ের , ▁হয়ে , ▁সেঞ্চুরি , ▁পেয়েছেন , ▁মাহমুদ , উল্লাহ , । , ]

Acknowledgement

Don't forget to say thanks to goru001 for his awesome collections of bengali WikiData. He also mantain a great Bengali-NLP task repo nlp-for-bengali

Then

To know, is to know that you know nothing. That is the meaning of true knowledge. Socrates