Skip to content

Frequency dictionaries for CHJ (Corpus of Historical Japanese), SHC (Showa-Heisei Corpus of written Japanese) and NWJC (NINJAL Web Japanese Corpus).

License

Notifications You must be signed in to change notification settings

uncomputable/frequency-dict

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Frequency dictionaries for Yomichan

High-quality frequency dictionaries ready to be imported into Yomichan.

Generate frequency dictionaries from source for customization.

A frequency dictionary displays the ranked frequency (1st most frequent, 2nd most frequent, ...) of a word inside a context (written language, spoken language, web, Showa era, Heisei era, ...).

Frequency dictionaries can help language learners distinguish common words from uncommon ones.

Features

Latest data

The data is kept up to date with NINJAL.

Unique dictionaries

Learn how words changed in frequency throughout history (CHJ, SHW).

Learn about frequent words on the Japanese web (NWJC).

Careful merging of files

When compiling a frequency dictionary, one has to be careful to not count the same word occurrence twice. This would corrupt the resulting word frequency.

The dictionaries in this repo are vetted against double-counting.

Frequency rank cap

The default dictionaries include the 50k most frequent words only. This keeps the files small and the learner focus on what is important: frequent words. Language fluency requires around 10k to 20k words of vocabulary.

Included dictionaries

You can find the dictionaries of the following corpora as GitHub releases.

The dictionary file shares the same license as its source data.

Creative Commons License

A corpus that covers different eras of Japanese history.

The corpus ranges from the Nara period through the Edo period and Meiji era up to the Taishō era.

To track words across eras, two dictionaries are generated:

  1. A dictionary for the premodern part (Nara to Edo)
  2. A dictionary for the modern part (Meiji to Taishō)

The corpus is likely too small to generate dictionaries for each era.

Creative Commons License

A corpus that covers the Showa and Heisei era of Japanese history.

There is one dictionary for both eras.

Creative Commons License

A corpus which was created by crawling the web.

Supported dictionaries

The licence of the following corpora doesn't allow me to upload a derived dictionary.

My solution is to publish the raw data in a separate repo.

Use my script to generate a frequency dictionary on your local machine.

Creative Commons License

One of the largest and most popular corpora out there. It focuses on written language.

Creative Commons License

Another popular corpus with a focus on spoken language.

Set up the runtime environment

Use nix

Enter the provided nix shell.

nix-shell

Use pip

Create a virtual environment and use pip to install the dependencies.

python3 -m venv venv && source venv/bin/activate
pip install -r requirements.txt

Run the script

Run the script on the command line with the desired arguments.

python3 main.py [arguments...]

For example, generate the frequency dictionary for BCCWJ (short-unit words) like so:

python3 main.py bccjw BCCWJ_frequencylist_suw_ver1_1.tsv

There is help in case you get stuck.

python3 main.py --help
python3 main.py bccjw --help

Import the dictionary

Open the Yomichan settings in your browser and click "Import Dictionary".

Select the zip file and wait for it to be processed.

The dictionary should now be working.

About

Frequency dictionaries for CHJ (Corpus of Historical Japanese), SHC (Showa-Heisei Corpus of written Japanese) and NWJC (NINJAL Web Japanese Corpus).

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published