Add the source used for this system #1

hugolpz · 2017-03-16T16:44:54Z

what is the source ? Unihan, CJKlib, Moedict, ...
how many characters covered

+Thanks for this project !

peterolson · 2017-03-16T19:34:30Z

The dictionary used is CC-CEDICT and whatever node-pinyin uses behind the scenes. I'm not sure exactly how many characters are covered, I'll have to investigate this later.

hugolpz · 2017-03-17T10:12:01Z

According to node-pinyin's Readme.md#Source

https://code.google.com/archive/p/chinese-character-2-pinyin/
maybe others pinyin sources listed as well (IME)

Strictly speaking, node-pinyin's data is in /tools/dict2.js. After cleanup, there are 24449 characters/phonetic pairs, which looks pretty much as the UNIHAN data, currently at 25500 entries.

node-pinyin's data format doesnt suit linguistic studies tho, as there can be several phonetic entries pairing with the same characters. Without prioritization (i.e. by freq), therefore fiting IME needs but not linguistic needs.

hugolpz mentioned this issue Mar 17, 2017

Data gathering parlr/ruby-font-creator#19

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add the source used for this system #1

Add the source used for this system #1

hugolpz commented Mar 16, 2017 •

edited by peterolson

peterolson commented Mar 16, 2017

hugolpz commented Mar 17, 2017 •

edited

Add the source used for this system #1

Add the source used for this system #1

Comments

hugolpz commented Mar 16, 2017 • edited by peterolson

peterolson commented Mar 16, 2017

hugolpz commented Mar 17, 2017 • edited

hugolpz commented Mar 16, 2017 •

edited by peterolson

hugolpz commented Mar 17, 2017 •

edited