Skip to content

karoliina/LangID.jl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 

Repository files navigation

LangID.jl

A language identification system written in Julia.

This is my course project for the Natural Language Processing course at MUN. Identifies the language of the given text by comparing its N-gram frequencies to those stored in a database created from Wikipedia articles in 30 different languages. Uses cosine similarity as the similarity metric.

Requires Julia 0.5, together with the ArgParse, HttpServer, ProgressMeter, Formatting and SQLite packages (install with Pkg.add("PackageName")).

Usage

julia LangID.jl will create the N-gram database on the first run. Requires an internet connection, since the program will query articles from the Wikipedia API. The resulting database is ~130 MB in size. Takes about 2.5 hours.

Once the database is created, it is saved under ngrams.sqlite by default. (A different filename can be given as a parameter to the program.) The command line version can then be run with julia LangID.jl.

For a web UI, you'll need to install vue, vue-resource and chart.js with npm (in the src directory) and then run browserify app.js > bundle.js in the static directory. After this, the server can be started with julia LangID.jl --serve. It serves the UI on localhost:8000.

This could be extended to visualize the similarities between the different languages contained in the database -- some sort of clustering would be an interesting experiment.

About

Language identification system written in Julia, based on N-gram counts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published