Language Identification

The goal is to identify the language in which a document, message, or sentence is written. To achieve this goal, I propose the following methodology. First, to transform a text into character n-gram features. Second, to build a machine-learned classifier to predict the language in which a text is written, using those character n-gram features to support the predictive power. Last, to evaluate the performance of the approach proposed here on a variety of datasets and against some of the most popular language identification libraries in the Python ecosystem.

Moreover, it is worth making clear that the language identification system will be limited to 14 out of 24 official languages of the European Union, namely:

Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Italian
Polish
Portuguese
Romanian
Spanish
Swedish

The above languages were chosen due to data availability matters, but mostly because their alphabet is based on modern Latin.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
langid.ipynb		langid.ipynb
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

langid.ipynb

langid.ipynb

utils.py

utils.py

Repository files navigation

Language Identification

Further Reading

About

Languages

License

jacerong/language-identification

Folders and files

Latest commit

History

Repository files navigation

Language Identification

Further Reading

About

Topics

Resources

License

Stars

Watchers

Forks

Languages