possible to add a custom lookup dict for characters_to_jyutping #37

raymond00000 · 2023-01-26T07:51:54Z

Describe the bug
I read this and understand the corpora used for characters_to_jyutping are.
(i) the HKCanCor corpus data included in the PyCantonese library, and (ii) the rime-cantonese data
https://pycantonese.org/jyutping.html

The issue I found is, it seems at least one word, if converted to jyutping, give an incorrect jyutping result?

To reproduce
pycantonese.characters_to_jyutping('到')
[('到', 'dou2')]
pycantonese.characters_to_jyutping('感到')
[('感到', 'gam2dou2')]
pycantonese.characters_to_jyutping('到底')
[('到底', 'dou3dai2')]

Expected behavior
according to here. https://humanum.arts.cuhk.edu.hk/Lexis/lexi-can/
到 should be dou3, so expected results are:
pycantonese.characters_to_jyutping('到')
[('到', 'dou3')]
pycantonese.characters_to_jyutping('感到')
[('感到', 'gam2dou3')]
pycantonese.characters_to_jyutping('到底')
[('到底', 'dou3dai2')]

I wonder if there is any way to resolve this problem, so pycantonese.characters_to_jyutping will return dou3 for 到 and 感到?
Thanks!

jacksonllee · 2023-03-30T13:30:19Z

Hi, sorry for not replying earlier. Between rime-cantonese and HKCanCor, the current code prefers the rime-cantonese data in case the two data sources don't agree. I'll have to dig into what the included rime-cantonese data looks like. Maybe the upstream rime-cantonese data has been updated and I could just use the updated data, or I could override these known cases. Thank you for reporting this!

laubonghaudoi · 2023-03-30T21:33:07Z

So I checked the rime-cantonese data, at least for 感到 and 到底 in word.csv, the prons are gam2 dou3 and dou3 dai2 which are correct.

jacksonllee · 2023-03-31T00:06:49Z

@laubonghaudoi Ah, I had no idea you guys had set up the CanCLID/rime-cantonese-upstream repo! Now I also see the char.csv file with this:

到,dou2,常用,,,
到,dou3,常用,,,

For my purposes, I'd need an automatic way to tell which char (or word, if this happens in word.csv) to pick for its jyutping. Is it safe to always choose the last one? Or is there another lookup or something?

raymond00000 added the bug label Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

possible to add a custom lookup dict for characters_to_jyutping #37

possible to add a custom lookup dict for characters_to_jyutping #37

raymond00000 commented Jan 26, 2023

jacksonllee commented Mar 30, 2023

laubonghaudoi commented Mar 30, 2023 •

edited

jacksonllee commented Mar 31, 2023

possible to add a custom lookup dict for characters_to_jyutping #37

possible to add a custom lookup dict for characters_to_jyutping #37

Comments

raymond00000 commented Jan 26, 2023

jacksonllee commented Mar 30, 2023

laubonghaudoi commented Mar 30, 2023 • edited

jacksonllee commented Mar 31, 2023

laubonghaudoi commented Mar 30, 2023 •

edited