Skip to content
This repository has been archived by the owner on Dec 28, 2020. It is now read-only.

Tries to predict which language a word belongs using LSTMs.

Notifications You must be signed in to change notification settings

toprakozturk/language-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Classifier

Tries to predict which language a word belongs using LSTMs.

Dataset contains 100K words. 50K from each: Turkish and English. I haven't included accented letters in Turkish like ü, ö, ı, ç, ş, because most of the Turkish words have one of them so it'd make the prediction lot easier. Instead, I've used the most approximate standard Latin letter - like o for ö.

English X, Q and W (Turkish alphabet doesn't present them) weren't touched since their frequency amongst English words are low.

Installing

  • Clone the repository.
  • cd language-classifier
  • pipenv install -r %% pipenv shell
  • Then run language-classifier/classifier.py

Requires Python 3.6+

Explanation

Imports

  • Keras: Deep learning framework
  • Numpy: Data storing and manipulation
  • Random: Just to shuffle the dataset.
import numpy as np
import random
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout, LSTM
from keras.optimizers import RMSprop

Exploring and Preparing the Data

Importing the dataset.

data = []

with open('turkish.txt') as textfile:
    for word in textfile:
        data.append((word.replace('\n', ''), 0))

with open('english.txt') as textfile:
    for word in textfile:
        data.append((word.replace('\n', ''), 1))

random.shuffle(data)

Exploring

words = [record[0] for record in data]
labels = [record[1] for record in data]

char_pool = sorted(set(''.join(words)))
longest = sorted(words, key=len)[-1]
maxlen = len(longest)
word_count = len(data)

n_classes = 2

print('Character pool: {}'.format(", ".join(char_pool)))
print('Longest word: {}'.format(longest))
print('Length of the longest word: {}'.format(maxlen))
print('Data size: {} words.'.format(word_count))

So the result is..

Character pool: a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, w, x, y, z
Longest word: trinitrophenylmethylnitramine
Length of the longest word: 29
Data size: 99957

Tokenizing char-wise.

char_indices = dict((c, i) for i, c in enumerate(char_pool))
indices_char = dict((i, c) for i, c in enumerate(char_pool))

Prepearing the training data. Basically creating a whole size 0 filled tensor, and then filling it with data as the data contains sequential one-hot arrays. Makes it easier for me.

x_data = np.zeros((word_count, maxlen, len(char_pool)), dtype=np.bool)
y_data = np.zeros((word_count, n_classes))

for i_word, word in enumerate(words):
    for i_char, char in enumerate(word):
        x_data[i_word, i_char, char_indices[char]] = 1

for i_label, label in enumerate(labels):
    y_data[i_label, label] = 1

The Predictive Model

model = Sequential()
model.add(LSTM(16, input_shape=(maxlen, len(char_pool))))
model.add(Dense(n_classes))
model.add(Activation('softmax'))

optimizer = RMSprop(lr=0.01)

model.compile(loss='categorical_crossentropy', optimizer=optimizer)

Training

for iteration in range(3):
    model.fit(x_data, y_data, batch_size=128, nb_epoch=1)

Launch!

def predict(word):
    processed_word = np.zeros((1, maxlen, len(char_pool)))
    for i_char, char in enumerate(word):
        processed_word[0, i_char, char_indices[char]] = 1
    prediction = model.predict(processed_word, verbose=0)[0]
    
    result = {'Turkish': prediction[0], 'English': prediction[1]}

    return result

Throw any word you want inside this list. It'll be our playing dataset.

# [!] be sure they are all lower-case.
word_list = [
    # supposed to be Turkish
    'altinvarak',
    'bulutsuzluk',
    'farmakoloji',
    'toprak',
    'hanimeli',
    'imkansiz',

    # supposed to be English
    'tensorflow',
    'jabba',
    'magsafe',
    'pharmacology',
    'parallax',
    'wabby',
    'querein',

    # curiosity
    'terminal', # an actual word in both languages
    'ahahahah',
    'ahahahahahahahah',
    'rawr',
]

for word in word_list:
    prediction = predict(word)
    print('{}: {}'.format(word, prediction))

Results

altinvarak:	    TUR: 0.98	ENG: 0.02
bulutsuzluk:    TUR: 0.99	ENG: 0.01
farmakoloji:    TUR: 0.97	ENG: 0.03
toprak:	        TUR: 0.90	ENG: 0.10
hanimeli:	    TUR: 0.97	ENG: 0.03
imkansiz:	    TUR: 0.99	ENG: 0.01
tensorflow:	    TUR: 0.00	ENG: 1.00
jabba:	        TUR: 0.75	ENG: 0.25
magsafe:	    TUR: 0.59	ENG: 0.41
pharmacology:   TUR: 0.00	ENG: 1.00
parallax:	    TUR: 0.00	ENG: 1.00
wabby:	        TUR: 0.00	ENG: 1.00
querein:	    TUR: 0.00	ENG: 1.00
terminal:	    TUR: 0.20	ENG: 0.80
ahahahah:	    TUR: 0.83	ENG: 0.17
ahahahahahahahah:TUR: 0.80	ENG: 0.20
rawr:	        TUR: 0.00	ENG: 1.00

Overall Accuracy: 457/500 (91.4)

About

Tries to predict which language a word belongs using LSTMs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages