Building a Language Detection Algorithm

This repo builds a language detection model that can be used for the 21 languages spoken in the European Union. For any given text, the model predicts its language.

This project is one of the challenge problem for the fellowship.ai application. The application organizers asked to build a model on the EU Parliament Parallel Corpus a corpus of ~5 GB of text files (1.5 GB zipped). They also supplied the test set to evaluate predictions on.

Summary

The model trained on the whole dataset achieves near-perfect (99.9%+) test set accuracy. While fitting the model to the whole data takes a significant amount of computation time (~2 hours), the same model fitted to a 1% subsample of the training data also achieves very good results (98.7%).

The table below lists the model results based on sample size.

Dataset	Model	Accuracy	Training Time	Inference Time	Vocabulary size	Link
Full Dataset	Word Level	99.96%	130 min	1.22 secs	403619	here
10% Sample	Word Level	99.90%	12 min	1.22 secs	433309	here
1% Sample	Word Level	98.79%	1 min	1.17 secs	151746	here
1% Sample	Character Level	99.44%	7 min	1.79 secs	327	here

Training time involves all preprocessing and model fitting (but not download time). Inference time involves predicting the ~20k sentences of the test set. The runs are done on a google cloud virtual machine with P100 GPU (the main notebook contains full specification).

Model Description

The model is based on a recurrent neural net. After some very basic preprocessing, I embed words into a 50 dimensional space. I feed the resulting embeddings through a standard GRU followed by a linear layer.

The main notebook contains detailed steps; as well as justification behind the hyper-parameter choices.

Description of Files

Lang_Class.ipynb: The main model, trained on the whole dataset. Contains explanations.
utils.py: Contains a list of (relatively uninteresting) helper functions.
Download_Data.ipynb: Downloads data, and puts in the appropriate directories.
Create_Smaller_Training_Set.ipynb: Copies a random subset of the data to a new directory (for faster model building).
Lang_Class_10pct.ipynb: Trains the main model on 10% of the data.
Lang_Class_10pct.ipynb: Trains the main model on 1% of the data.
Lang_Class_charlvl_1pct.ipynb: Trains a character-level model on 1% of the data.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.gitignore		.gitignore
Create_Smaller_Training_Set.ipynb		Create_Smaller_Training_Set.ipynb
Download_Data.ipynb		Download_Data.ipynb
Lang_Class.ipynb		Lang_Class.ipynb
Lang_Class_10pct.ipynb		Lang_Class_10pct.ipynb
Lang_Class_1pct.ipynb		Lang_Class_1pct.ipynb
Lang_Class_charlvl_1pct.ipynb		Lang_Class_charlvl_1pct.ipynb
README.md		README.md
model_illustration.jpg		model_illustration.jpg
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

Create_Smaller_Training_Set.ipynb

Create_Smaller_Training_Set.ipynb

Download_Data.ipynb

Download_Data.ipynb

Lang_Class.ipynb

Lang_Class.ipynb

Lang_Class_10pct.ipynb

Lang_Class_10pct.ipynb

Lang_Class_1pct.ipynb

Lang_Class_1pct.ipynb

Lang_Class_charlvl_1pct.ipynb

Lang_Class_charlvl_1pct.ipynb

README.md

README.md

model_illustration.jpg

model_illustration.jpg

utils.py

utils.py

Repository files navigation

Building a Language Detection Algorithm

Summary

Model Description

Description of Files

About

Releases

Packages

Languages

kk1694/Lang_Detect

Folders and files

Latest commit

History

Repository files navigation

Building a Language Detection Algorithm

Summary

Model Description

Description of Files

About

Resources

Stars

Watchers

Forks

Languages