LangIdentify 🏳️‍🌈

This repository includes code to train and evaluate a language identification model as well as code to launch a small web-application and interactively test the model.

Trained model 💪

A pre-trained model is available here. It is trained on a balanced dataset of 240k sentences in german 🇩🇪, english 🇬🇧, frensh 🇫🇷, italian 🇮🇹, portuguese 🇵🇹 and spanish 🇪🇸.

Accuracy 🎯: 98.73%

Confusion matrix 🤯:

Demo application 🔥

Test the model with the demo application. Start the app with $ streamlit run app.py. Then, open http://localhost:8501/ in your browser.

Install requirements ⚙️

$ conda create -n langidentify python=3.8
$ conda activate langidentify
(Please check https://pytorch.org/get-started/locally/ and select the correct command depending on your CUDA version.)
$ conda install pytorch==1.7.1 torchvision==0.8.2 torchaudio==0.7.2 cudatoolkit=10.1 -c pytorch
$ pip install -r requirements.txt

Model training, evaluation & testing

Download the dataset here and save the sentences.csv file under data/sentences.csv.
Filter the data with $ python filter_dataset.py. This creates a balanced dataset and a train, val, test split of 80/10/10 for 6 languages.
Pre-process the dataset with $ python preprocess_dataset.py. This generates a feature representation (most common trigrams) for the data.
Run $ python main.py with mode set to TRAIN (default), EVAL or TEST. The trained model is saved under checkpoints/model.pth by default.

For reproducibility, the random seed is set to 42 in filter_dataset.py and 420 in main.py. You might want to change these numbers to obtain different results.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
checkpoints		checkpoints
data		data
imgs		imgs
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
filter_dataset.py		filter_dataset.py
main.py		main.py
model.py		model.py
preprocess_dataset.py		preprocess_dataset.py
requirements.txt		requirements.txt
text_dataset.py		text_dataset.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

checkpoints

checkpoints

data

data

imgs

imgs

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

app.py

app.py

filter_dataset.py

filter_dataset.py

main.py

main.py

model.py

model.py

preprocess_dataset.py

preprocess_dataset.py

requirements.txt

requirements.txt

text_dataset.py

text_dataset.py

Repository files navigation

LangIdentify 🏳️‍🌈

Trained model 💪

Demo application 🔥

Install requirements ⚙️

Model training, evaluation & testing

About

Releases

Packages

Languages

License

bigabig/langidentify

Folders and files

Latest commit

History

Repository files navigation

LangIdentify 🏳️‍🌈

Trained model 💪

Demo application 🔥

Install requirements ⚙️

Model training, evaluation & testing

About

Topics

Resources

License

Stars

Watchers

Forks

Languages