Skip to content

dan1wang/jsonbook-builder

Repository files navigation

jsonbook

Wiktionary dump in accessible JSON format.

TL;DR

Just get the files: Wiktionary dump in JSON format

Why

The English Wiktionary gracefully provides regular dumps of all its content so anyone can easily parse the content for his/her own use. Unfortunately, the parsing part can be daunting and quickly turn any interested developer off. This is because:

  • The dump file is very big (5.5Gb uncompressed)
  • The dump file contains many other things.
  • Wiktionary is a dictionary in English, not a English dictionary, so its entries includes a lot of non-English words.
  • The entries aren't in alphabetical order.

In other words, to get to the content you want, you have first filter through millions of pages you aren't remotely interested in. Or, you can spam the system with lots of API calls.

Solution

Jsonbook helps remove the first road block of your making awesome use of the Wiktionary dump. It does the following:

  1. Retrieve only the word articles
  2. Organize all the articles by language
  3. Convert the text to hierarchical tree
  4. Save all the content to individual JSON files.

See a sample output of the entry for "gratis".

Currently it takes about 58 minutes to parse the entire English Wiktionary dump on a 2013 MacBook Pro.