Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the source used for this system #1

Open
2 tasks
hugolpz opened this issue Mar 16, 2017 · 2 comments
Open
2 tasks

Add the source used for this system #1

hugolpz opened this issue Mar 16, 2017 · 2 comments

Comments

@hugolpz
Copy link

hugolpz commented Mar 16, 2017

  • what is the source ? Unihan, CJKlib, Moedict, ...
  • how many characters covered

+Thanks for this project !

@peterolson
Copy link
Owner

The dictionary used is CC-CEDICT and whatever node-pinyin uses behind the scenes. I'm not sure exactly how many characters are covered, I'll have to investigate this later.

@hugolpz
Copy link
Author

hugolpz commented Mar 17, 2017

According to node-pinyin's Readme.md#Source

Strictly speaking, node-pinyin's data is in /tools/dict2.js. After cleanup, there are 24449 characters/phonetic pairs, which looks pretty much as the UNIHAN data, currently at 25500 entries.
screenshot from 2017-03-17 11-08-36

screenshot from 2017-03-17 11-07-40

node-pinyin's data format doesnt suit linguistic studies tho, as there can be several phonetic entries pairing with the same characters. Without prioritization (i.e. by freq), therefore fiting IME needs but not linguistic needs.

screenshot from 2017-03-17 11-11-28

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants