Skip to content

sagorbrur/bnlm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Bengal Language Model

Build Status Documentation Status pypi version python version

Bengali language model is build with fastai's ULMFit and ready for prediction and classfication task.

Contents

NB:

  • This tool mostly followed inltk
  • We separated Bengali part with better evaluation results

Installation

pip install bnlm

Dependencies

  • use pytorch >=1.0.0 and <=1.3.0

Evaluation Result

Language Model

  • Accuracy 48.26% on validation dataset
  • Perplexity: ~22.79

Features and API

Download pretrained Model

To start, first download pretrained Language Model and Sentencepiece model

from bnlm.bnlm import download_models

download_models()

Predict N Words

predict_n_words take three parameter as input:

  • input_sen(Your incomplete input text)
  • N(Number of word for prediction)
  • model_path(Pretrained model path)
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import predict_n_words
model_path = 'model'
input_sen = "আমি বাজারে"
output = predict_n_words(input_sen, 3, model_path)
print("Word Prediction: ", output)

Get Sentence Encoding

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_sentence_encoding
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি ভাত খাই।"
encoding = get_sentence_encoding(input_sentence, model_path, sp_model)
print("sentence encoding is: ", encoding)

Get Embedding Vectors

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_embedding_vectors
model_path = 'model'
sp_model = "model/bn_spm.model"
input_sentence = "আমি ভাত খাই।"
embed = get_embedding_vectors(input_sentence, model_path, sp_model)
print("sentence embedding is : ", embed)

Sentence Similarity

from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_sentence_similarity
model_path = 'model'
sp_model = "model/bn_spm.model"
sentence_1 = "সে খুব করে কথা বলে।"
sentence_2 = "তার কথা খুবেই মিষ্টি।"
sim = get_sentence_similarity(sentence_1, sentence_2, model_path, sp_model)
print("Similarity is: %0.2f"%sim)

# Output:  0.72

Get Simillar Sentences

get_similar_sentences take four parameter

  • input sentence
  • N(Number of sentence you want to predict)
  • model_path(Pretrained Model Path)
  • sp_model(pretrained sentencepiece model)
from bnlm.bnlm import BengaliTokenizer
from bnlm.bnlm import get_similar_sentences

model_path = 'model'
sp_model = "model/bn_spm.model"

input_sentence = "আমি বাংলায় গান গাই।"
sen_pred = get_similar_sentences(input_sentence, 3, model_path, sp_model)
print(sen_pred)
# output: ['আমি বাংলায় গান গাই ।', 'আমি ইংরেজিতে গান গাই।', 'আমি বাংলায় গানও গাই।']

Classification

upcomming

Training

To train with your own corpus follow this repository

Contributor