LangID.jl

A language identification system written in Julia.

This is my course project for the Natural Language Processing course at MUN. Identifies the language of the given text by comparing its N-gram frequencies to those stored in a database created from Wikipedia articles in 30 different languages. Uses cosine similarity as the similarity metric.

Requires Julia 0.5, together with the ArgParse, HttpServer, ProgressMeter, Formatting and SQLite packages (install with Pkg.add("PackageName")).

Usage

julia LangID.jl will create the N-gram database on the first run. Requires an internet connection, since the program will query articles from the Wikipedia API. The resulting database is ~130 MB in size. Takes about 2.5 hours.

Once the database is created, it is saved under ngrams.sqlite by default. (A different filename can be given as a parameter to the program.) The command line version can then be run with julia LangID.jl.

For a web UI, you'll need to install vue, vue-resource and chart.js with npm (in the src directory) and then run browserify app.js > bundle.js in the static directory. After this, the server can be started with julia LangID.jl --serve. It serves the UI on localhost:8000.

This could be extended to visualize the similarities between the different languages contained in the database -- some sort of clustering would be an interesting experiment.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

src

src

.gitignore

.gitignore

README.md

README.md

Repository files navigation

LangID.jl

Usage

About

Releases

Packages

Languages

karoliina/LangID.jl

Folders and files

Latest commit

History

Repository files navigation

LangID.jl

Usage

About

Resources

Stars

Watchers

Forks

Languages