Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible to add a custom lookup dict for characters_to_jyutping #37

Open
raymond00000 opened this issue Jan 26, 2023 · 3 comments
Open
Labels

Comments

@raymond00000
Copy link

Describe the bug
I read this and understand the corpora used for characters_to_jyutping are.
(i) the HKCanCor corpus data included in the PyCantonese library, and (ii) the rime-cantonese data
https://pycantonese.org/jyutping.html

The issue I found is, it seems at least one word, if converted to jyutping, give an incorrect jyutping result?

To reproduce
pycantonese.characters_to_jyutping('到')
[('到', 'dou2')]
pycantonese.characters_to_jyutping('感到')
[('感到', 'gam2dou2')]
pycantonese.characters_to_jyutping('到底')
[('到底', 'dou3dai2')]

Expected behavior
according to here. https://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/
到 should be dou3, so expected results are:
pycantonese.characters_to_jyutping('到')
[('到', 'dou3')]
pycantonese.characters_to_jyutping('感到')
[('感到', 'gam2dou3')]
pycantonese.characters_to_jyutping('到底')
[('到底', 'dou3dai2')]

I wonder if there is any way to resolve this problem, so pycantonese.characters_to_jyutping will return dou3 for 到 and 感到?
Thanks!

@jacksonllee
Copy link
Owner

Hi, sorry for not replying earlier. Between rime-cantonese and HKCanCor, the current code prefers the rime-cantonese data in case the two data sources don't agree. I'll have to dig into what the included rime-cantonese data looks like. Maybe the upstream rime-cantonese data has been updated and I could just use the updated data, or I could override these known cases. Thank you for reporting this!

@laubonghaudoi
Copy link

laubonghaudoi commented Mar 30, 2023

So I checked the rime-cantonese data, at least for 感到 and 到底 in word.csv, the prons are gam2 dou3 and dou3 dai2 which are correct.

@jacksonllee
Copy link
Owner

@laubonghaudoi Ah, I had no idea you guys had set up the CanCLID/rime-cantonese-upstream repo! Now I also see the char.csv file with this:

到,dou2,常用,,,
到,dou3,常用,,,

For my purposes, I'd need an automatic way to tell which char (or word, if this happens in word.csv) to pick for its jyutping. Is it safe to always choose the last one? Or is there another lookup or something?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants