Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unidic design flaw #118

Open
wareya opened this issue Aug 28, 2017 · 4 comments
Open

Unidic design flaw #118

wareya opened this issue Aug 28, 2017 · 4 comments

Comments

@wareya
Copy link

wareya commented Aug 28, 2017

Unidic's lex data doesn't have enough information for the viterbi algorithm to distinguish words with the same readings and same word types in context. So お父さん is always interpreted as お・ちち・さん, instead of お・とう・さん like it should be.

父,5142,5142,3860,名詞,普通名詞,一般,*,*,*,チチ,父,父,チチ,父,チチ,和,*,*,*,*

父,5142,5142,4656,名詞,普通名詞,一般,*,*,*,トウ,父,父,トー,父,トー,和,*,*,*,*

They're otherwise identical, but the ちち reading has a lower cost, so it always wins when the word is in the kanji form. Basically, unidic's segment features don't have a way to distinguish these. It's easy to write a script that looks for segments that are identical in surface form and feature list and see what problematic matches there are.

This is basically impossible to fix on kuromoji's side without adding a list of segments that act differently than their features indicate, which would be ridiculous. On the other hand, one of kuromoji's implicit goals is to not be worse than other morphological analyzers, so this is a problem worth posting about.

I added a bunch of お父 etc. entries to my user dictionary to gloss over this problem by prepending the お・御. (for unidic-kanaaccent STAGING)

おとう,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,おとう,オトー,おとう,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
お父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,お父,オトー,お父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
御父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,御父,オトー,御父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*

おかあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,おかあ,オカー,おかあ,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
お母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,お母,オカー,お母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
御母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,御母,オカー,御母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*

おにい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,おにい,オニー,おにい,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
お兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,お兄,オニー,お兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
御兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,御兄,オニー,御兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*

おねえ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,おねえ,オネー,おねえ,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
お姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,お姉,オネー,お姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,御姉,オネー,御姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

お姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,お姐,オネー,お姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,御姐,オネー,御姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

おばあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,おばあ,オバー,おばあ,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
お婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,お婆,オバー,お婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
御婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,御婆,オバー,御婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*


おじい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,おじい,オジー,おじい,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
お爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,お爺,オジー,お爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
御爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,御爺,オジー,御爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*

(weights are for illustration, I think they're too high to catch in all intended cases)

@fasiha
Copy link

fasiha commented Jan 18, 2018

In your particular example, if I ask for the two best results, I get とう instead of ちち: here I'm use MeCab/Unidic but should be the same with Kuromoji:

➜ echo お父さん | mecab -d /usr/local/lib/mecab/dic/unidic -N2
お	オ	オ	御	接頭辞
父	チチ	チチ	父	名詞-普通名詞-一般
さん	サン	サン	さん	接尾辞-名詞的-一般
EOS
お	オ	オ	御	接頭辞
父	トー	トウ	父	名詞-普通名詞-一般
さん	サン	サン	さん	接尾辞-名詞的-一般
EOS

Isn't this one of the big reasons why these parsers give you N-best results, N>1?

@wareya
Copy link
Author

wareya commented Apr 2, 2018

unidic 2.3.0 solves this problem for the specific case of this set of words

@cmoen
Copy link
Member

cmoen commented Apr 2, 2018

Could you indicate more precisely what you mean by "unidic 2.3.0"? Do you have a URL you can share? Thanks.

@wareya
Copy link
Author

wareya commented Apr 2, 2018

http://unidic.ninjal.ac.jp/

2018/03/29
現代語用UniDicのv2.3.0(beta版)を公開
alpha版は4月上旬にフルパッケージで公開します。

It's only listed on the "back-number" page:

http://unidic.ninjal.ac.jp/back_number

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants