Wikokit - Machine-readable Wiktionary

Language is a city to the building of which every human being brought a stone.

Ralph Waldo Emerson

Wikokit - Machine-readable Wiktionary

Stone I. Parser wikokit. This program parses Wiktionaries, constructs and fills machine-readable Wiktionaries.

Stone II. PHP API (piwidict project) to work with machine-readable Wiktionary.

The goal of this project is to extract semi-structured information from Wiktionary and construct machine-readable dictionary (database + API + GUI).

Download new Wiktionary parsed databases from Academic Torrents:

Russian Wiktionary parsed ruwikt20230901;
English Wiktionary parsed enwikt20231001.

Archives of Wiktionary parsed databases are available at whinger.krc.karelia.ru/soft/wikokit.

How to import dump of parsed Wiktionary into MySQL (in Russian).

Stone I: Parser and dictionary description

I) The maximum goal (in distant future) is to extract all information (i.e. all sections of entry) from all wiktionaries and convert data to machine-readable format.

II) Today's result. Now machine-readable Wiktionary contains the following information extracted from Russian Wiktionary and English Wiktionary:

word's language and part of speech;
meanings / definitions;
semantic relations;
translations;
(^) context labels (from definitions);
(^) quotations (text + bibliographic data).

(^) Context labels and quotations were extracted only from Russian Wiktionary.

Parsed Wiktionary database schema

The structure (tables and relations) of the Wiktionary parsed database (database layout, see the file wikt_parsed_empty_with_foreign_keys.png):

Set of tables related to quotations (fragment of the Wiktionary parsed database):

State of the art and Future work

Machine-readable Wiktionary framework:

I am interested that all two hundred Wiktionaries were parsed by this parser. But I know only Russian and English :)

If you are developer and if you are interested in adding modules to parse "your Wiktionary", then

start from the paper describing the database (tables and relations) of machine-readable Wiktionary: Transformation of Wiktionary entry structure into tables and relations in a relational database schema. 2010. But there are new tables (absent in the publication) related to quotations and context labels, see Machine-readable database schema;
GettingStartedWiktionaryParser — install parser and try to parse English Wiktionary and Russian Wiktionary;
Play with parsed English or Russian Wiktionary SQL — download dumps of Wiktionary parsed databases from Academic Torrents;
OneMoreWiktionary — extend parser in order to extract invaluable information from your Wiktionary.

Statistics

The machine-readable dictionary database statistics:

English Wiktionary: total, semantic relations, translations, part of speech
Russian Wiktionary: total, semantic relations, translations, part of speech, context labels, quote (languages & sources, authors with clusters, other authors, years)

Project structure

Wiki tool kit (wikokit) contains several projects related to wiki

./common_wiki — common (low-level) functions to handle data of Wikipedia and Wiktionary in MySQL database,

./common_wiki_jdbc — functions to handle data of Wiktionary in MySQL and SQLite databases (JDBC, Java SE) (depends on common_wiki.jar).

./android/common_wiki_alink — Eclipse copy (source link) of ./common_wiki (!NetBeans)

./android/common_wiki_android — functions for access to Wiktionary in Android SQLite version of database (depends on common_wiki.jar).

./android/magnetowordik — Android word game (Wiktionary thesaurus).

./hits_wiki — API for access to Wikipedia in MySQL database, algorithms to search synonyms in Wikipedia (depends on jcfd.jar, common_wiki.jar).

./TGWikiBrowser — visual browser to search for synonyms in local or remote Wikipedia (depends on hits_wiki.jar and common_wiki.jar)

./wikidf — Wiki Index Database (list of lemmas and links to wiki pages, which contain these lemmas).

./wikt_parser — Wiktionary parser creates a MySQL database (like WordNet) from an Wiktionary MySQL dump file. The project goal is to convert Wiktionary articles to machine-readable format. (It depends on common_wiki, common_wiki_jdbc)

./wiwordik — Visualization of parsed Wiktionary database. wiki + word = wiwordik.

The code of previous project Synarcher are used in wikokit.

License

This program is multi-licensed and may be used under the terms of any of the following licenses:

EPL, Eclipse Public License V1.0 or later, http://www.eclipse.org/legal
LGPL, GNU Lesser General Public License V3.0 or later, http://www.gnu.org/licenses/lgpl.html
GPL, GNU General Public License V3.0 or later, http://www.gnu.org/licenses/gpl.html
AL, Apache License, V2.0 or later, http://www.apache.org/licenses
BSD, New BSD License, http://www.opensource.org/licenses/bsd-license

See documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 510 Commits
TGWikiBrowser		TGWikiBrowser
android		android
common_wiki		common_wiki
common_wiki_jdbc		common_wiki_jdbc
hits_wiki		hits_wiki
jcfd		jcfd
media_commons/image.py		media_commons/image.py
piwidict		piwidict
sql_procedures/hyponyms		sql_procedures/hyponyms
wigraph		wigraph
wiki		wiki
wikidf		wikidf
wikt_parser		wikt_parser
wiwordik		wiwordik
.gitignore		.gitignore
LICENSE-2.0.txt		LICENSE-2.0.txt
README.md		README.md
gpl.txt		gpl.txt
index.txt		index.txt
release_notes.txt		release_notes.txt
run_wiwordik.bat		run_wiwordik.bat

componavt/wikokit

Folders and files

Latest commit

History

Repository files navigation

Wikokit - Machine-readable Wiktionary

Stone I: Parser and dictionary description

Parsed Wiktionary database schema

State of the art and Future work

Statistics

Project structure

Further reading

In English

In Russian

See also

License

About

Resources

Stars

Watchers

Forks

Languages