langdetect

Language detection library in Python. Implementation based on n-gram text categorization, according to the article N-Gram-Based Text Categorization.

Usage

Given a text, it returns a list of tuples of length MAX_RESULTS, sorted according to the probabilites of that text belonging to each language. The tuples are (language_code, probability). The language codes follow the ISO 639-1 Standard

Library usage:

>>> text = """Automatic summarization is the process of reducing a text document with a
computer program in order to create a summary that retains the most important points
of the original document. As the problem of information overload has grown, and as
the quantity of data has increased, so has interest in automatic summarization.
Technologies that can make a coherent summary take into account variables such as
length, writing style and syntax. An example of the use of summarization technology
is search engines such as Google. Document summarization is another."""

>>> import langdetect as ld
>>> print ld.detect_language(text)
[('en', 0.9201609943007083), ('fr', 0.07217134307468472), ('ro', 0.0076676626246070185)]

Since building the models for every language it is a time consuming operation, to perform many detections it is better to use:

>>> profiles = ld.create_languages_profiles()
>>> ld.detect_language(text, profiles)

Command-line usage:

cd path/to/folder/langdetect/
python langdetect.py -f FILE

Datasets

The datasets to train, validate and test the software were collected with this scrapper from Wikipedia articles.

Tests

Just by cloning the test can be run by:

python test_langdetect.py

This will print out the resulting detection precision for the train, validation and test datasets, for every language. It could be useful to see the results in case of changing the train dataset or at adjusting parameters of the algorithm.

Available Languages

Language	Code
ar	Arabic
cs	Czech
da	Danish
en	English
et	Estonian
fi	Finnish
fr	French
de	German
el	Greek
he	Hebrew
hu	Hungarian
it	Italian
lv	Latvian
lt	Lithuanian
no	Norwegian
fa	Persian
pl	Polish
pt	Portuguese
ro	Romanian
ru	Russian
sk	Slovak
es	Spanish
sv	Swedish

Adding a new language

Add the dataset of the new language inside the datasets/train directory. The dataset should be text files of the given language inside a directory with the language code as name. According to the article, with 100 Kilobytes is enough.
Add the language code and name to the LANGUAGES dictionary in the langdetect.py file.
OPTIONAL: if you want to test the new language, add a test dataset under the datasets/test directory. Then, add the language code to the TESTING_LANGUAGES dictionary in the test_langdetect.py file.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
datasets		datasets
doc		doc
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
grid_search.py		grid_search.py
langdetect.py		langdetect.py
test_langdetect.py		test_langdetect.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datasets

datasets

doc

doc

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

grid_search.py

grid_search.py

langdetect.py

langdetect.py

test_langdetect.py

test_langdetect.py

Repository files navigation

langdetect

Usage

Datasets

Tests

Available Languages

Adding a new language

About

Releases

Packages

Languages

License

fedelopez77/langdetect

Folders and files

Latest commit

History

Repository files navigation

langdetect

Usage

Datasets

Tests

Available Languages

Adding a new language

About

Topics

Resources

License

Stars

Watchers

Forks

Languages