Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

「湯」默認讀音係棄用音 #5

Open
AlienKevin opened this issue Aug 8, 2023 · 3 comments
Open

「湯」默認讀音係棄用音 #5

AlienKevin opened this issue Aug 8, 2023 · 3 comments

Comments

@AlienKevin
Copy link

>>> ToJyutping.get_jyutping_text("湯")
'soeng1

上游標識soeng1係「湯」字嘅棄用音,唔知點解ToJyutping會默認輸出soeng1?

湯,joeng4,罕見,,,
湯,soeng1,棄用,,,
湯,tong1,預設,,,
湯,tong3,罕見,,,

https://github.com/CanCLID/rime-cantonese-upstream/blob/ba155365c8671ca51848224dec933d5b91091d05/char.csv#L17524C1-L17528C1

@graphemecluster
Copy link
Member

因為仲用緊發音優先順序幾乎係 random 嘅舊版詞表,冇人得閒整……
我其實之前寫過 CanCLID/to-jyutping#3 ,但係 test 得唔夠唔敢亂咁 merge
如果你得閒,可以幫我哋 review 下 test 下同埋 reflect 啲 change 過嚟 Python 版🙇🏻‍♂️

@AlienKevin
Copy link
Author

AlienKevin commented Aug 10, 2023

我用粵典嘅33,043句例句發音嚟test咗而家嘅ToJyutping,測試結果Syllable Error Rate係7.33%。我覺得可以通過分析error嘅類型嚟提升準確率,同埋呢個test set可以作為一個regression test,以避免將來update詞表或者改變排序算法而引發新問題。我可以新建個repo叫類似to-jyutping-tests,將而家所有粵典同ToJyutping標注相同嘅句子作為regression test嘅基礎,測試下你嘅PR有冇break之前通過嘅句子。Python同JS嘅版本未來都可以reference同一個test。

我粗略睇咗下test結果,總結出大致6類error:

  1. 常用詞嘅發音唔同
  • 嘞 laak3 vs la3
  • 呢 ni1 vs nei1/ne1
  1. 多音字聲調問題
  • 反轉
    ToJyutping: faan2 zyun3 gin6 saam1 zoek6
    words.hk: faan2 zyun3 gin5 saam1 zoek3
  • 要穩
    ToJyutping: haa6 pun2 jiu3 wan2
    words.hk: haa6 pun4 jiu3 wan2
  1. 變調問題:
  • 咁多人
    ToJyutping: gam3 do1 jan4 ge3
    words.hk: gam3 do1 jan4 ge2
  • 子女免税
    ToJyutping: zi2 neoi5 min5 seoi3 aak6
    words.hk: zi2 neoi5 min5 seoi3 aak2
  1. a vs aa
  • 疑心暗鬼
    ToJyutping: ji4 sam1 saang1 am3 gwai2
    words.hk: ji4 sam1 sang1 am3 gwai2
  1. 錯誤標註罕見讀音
  • 蓮藕瘦肉
    ToJyutping: lin4 au5 sau3 juk6 soeng1
    words.hk: lin4 au5 sau3 juk6 tong1
  1. words.hk筆誤
  • 宣讀員
    ToJyutping: zaan3 ci4 syun1 duk6 jyun4
    words.hk: zaan3 cin4 syun1 duk6 jyun4

更多結果

@AlienKevin
Copy link
Author

@graphemecluster I added a draft PR #6 which addresses some of the most pressing issues. See the PR for a summary of the improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants