Skip to content

componavt/wikokit

Repository files navigation

Language is a city to the building of which every human being brought a stone.

Ralph Waldo Emerson

Wikokit - Machine-readable Wiktionary

Stone I. Parser wikokit. This program parses Wiktionaries, constructs and fills machine-readable Wiktionaries.

Stone II. PHP API (piwidict project) to work with machine-readable Wiktionary.

The goal of this project is to extract semi-structured information from Wiktionary and construct machine-readable dictionary (database + API + GUI).

Download new Wiktionary parsed databases from Academic Torrents:

Archives of Wiktionary parsed databases are available at whinger.krc.karelia.ru/soft/wikokit.

How to import dump of parsed Wiktionary into MySQL (in Russian).

Stone I: Parser and dictionary description

I) The maximum goal (in distant future) is to extract all information (i.e. all sections of entry) from all wiktionaries and convert data to machine-readable format.

II) Today's result. Now machine-readable Wiktionary contains the following information extracted from Russian Wiktionary and English Wiktionary:

  1. word's language and part of speech;
  2. meanings / definitions;
  3. semantic relations;
  4. translations;
  5. (^) context labels (from definitions);
  6. (^) quotations (text + bibliographic data).

(^) Context labels and quotations were extracted only from Russian Wiktionary.

Parsed Wiktionary database schema

The structure (tables and relations) of the Wiktionary parsed database (database layout, see the file wikt_parsed_empty_with_foreign_keys.png):

Wiktionary parsed database

Set of tables related to quotations (fragment of the Wiktionary parsed database):

quotations tables of the Wiktionary parsed database

State of the art and Future work

Machine-readable Wiktionary framework: Machine-readable Wiktionary framework

I am interested that all two hundred Wiktionaries were parsed by this parser. But I know only Russian and English :)

If you are developer and if you are interested in adding modules to parse "your Wiktionary", then

Statistics

The machine-readable dictionary database statistics:

Project structure

Wiki tool kit (wikokit) contains several projects related to wiki

./common_wiki — common (low-level) functions to handle data of Wikipedia and Wiktionary in MySQL database,

./common_wiki_jdbc — functions to handle data of Wiktionary in MySQL and SQLite databases (JDBC, Java SE) (depends on common_wiki.jar).

./android/common_wiki_alink — Eclipse copy (source link) of ./common_wiki (!NetBeans)

./android/common_wiki_android — functions for access to Wiktionary in Android SQLite version of database (depends on common_wiki.jar).

./android/magnetowordik — Android word game (Wiktionary thesaurus).

./hits_wiki — API for access to Wikipedia in MySQL database, algorithms to search synonyms in Wikipedia (depends on jcfd.jar, common_wiki.jar).

./TGWikiBrowser — visual browser to search for synonyms in local or remote Wikipedia (depends on hits_wiki.jar and common_wiki.jar)

./wikidf — Wiki Index Database (list of lemmas and links to wiki pages, which contain these lemmas).

./wikt_parser — Wiktionary parser creates a MySQL database (like WordNet) from an Wiktionary MySQL dump file. The project goal is to convert Wiktionary articles to machine-readable format. (It depends on common_wiki, common_wiki_jdbc)

./wiwordik — Visualization of parsed Wiktionary database. wiki + word = wiwordik.

The code of previous project Synarcher are used in wikokit.

Further reading

In English

In Russian

See also

License

This program is multi-licensed and may be used under the terms of any of the following licenses:

See documentation.