Unidic design flaw #118

wareya · 2017-08-28T06:02:51Z

Unidic's lex data doesn't have enough information for the viterbi algorithm to distinguish words with the same readings and same word types in context. So お父さん is always interpreted as お・ちち・さん, instead of お・とう・さん like it should be.

父,5142,5142,3860,名詞,普通名詞,一般,*,*,*,チチ,父,父,チチ,父,チチ,和,*,*,*,*

父,5142,5142,4656,名詞,普通名詞,一般,*,*,*,トウ,父,父,トー,父,トー,和,*,*,*,*

They're otherwise identical, but the ちち reading has a lower cost, so it always wins when the word is in the kanji form. Basically, unidic's segment features don't have a way to distinguish these. It's easy to write a script that looks for segments that are identical in surface form and feature list and see what problematic matches there are.

This is basically impossible to fix on kuromoji's side without adding a list of segments that act differently than their features indicate, which would be ridiculous. On the other hand, one of kuromoji's implicit goals is to not be worse than other morphological analyzers, so this is a problem worth posting about.

I added a bunch of お父 etc. entries to my user dictionary to gloss over this problem by prepending the お・御. (for unidic-kanaaccent STAGING)

おとう,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,おとう,オトー,おとう,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
お父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,お父,オトー,お父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*
御父,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オトウ,御父,御父,オトー,御父,オトー,和,*,*,*,*,オトウ,オトウ,オトウ,オトウ,*,*,2,*,*

おかあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,おかあ,オカー,おかあ,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
お母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,お母,オカー,お母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*
御母,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オカア,御母,御母,オカー,御母,オカー,和,*,*,*,*,オカア,オカア,オカア,オカア,*,*,2,*,*

おにい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,おにい,オニー,おにい,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
お兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,お兄,オニー,お兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*
御兄,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オニイ,御兄,御兄,オニー,御兄,オニー,和,*,*,*,*,オニイ,オニイ,オニイ,オニイ,*,*,2,*,*

おねえ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,おねえ,オネー,おねえ,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
お姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,お姉,オネー,お姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姉,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姉,御姉,オネー,御姉,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

お姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,お姐,オネー,お姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*
御姐,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オネエ,御姐,御姐,オネー,御姐,オネー,和,*,*,*,*,オネエ,オネエ,オネエ,オネエ,*,*,2,*,*

おばあ,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,おばあ,オバー,おばあ,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
お婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,お婆,オバー,お婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*
御婆,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オバア,御婆,御婆,オバー,御婆,オバー,和,*,*,*,*,オバア,オバア,オバア,オバア,*,*,2,*,*


おじい,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,おじい,オジー,おじい,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
お爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,お爺,オジー,お爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*
御爺,5142,5142,8000,名詞,普通名詞,一般,*,*,*,オジイ,御爺,御爺,オジー,御爺,オジー,和,*,*,*,*,オジイ,オジイ,オジイ,オジイ,*,*,2,*,*

(weights are for illustration, I think they're too high to catch in all intended cases)

The text was updated successfully, but these errors were encountered:

fasiha · 2018-01-18T18:56:11Z

In your particular example, if I ask for the two best results, I get とう instead of ちち: here I'm use MeCab/Unidic but should be the same with Kuromoji:

➜ echo お父さん | mecab -d /usr/local/lib/mecab/dic/unidic -N2
お	オ	オ	御	接頭辞
父	チチ	チチ	父	名詞-普通名詞-一般
さん	サン	サン	さん	接尾辞-名詞的-一般
EOS
お	オ	オ	御	接頭辞
父	トー	トウ	父	名詞-普通名詞-一般
さん	サン	サン	さん	接尾辞-名詞的-一般
EOS

Isn't this one of the big reasons why these parsers give you N-best results, N>1?

wareya · 2018-04-02T01:43:43Z

unidic 2.3.0 solves this problem for the specific case of this set of words

cmoen · 2018-04-02T21:44:27Z

Could you indicate more precisely what you mean by "unidic 2.3.0"? Do you have a URL you can share? Thanks.

wareya · 2018-04-02T21:45:29Z

http://unidic.ninjal.ac.jp/

2018/03/29
現代語用UniDicのv2.3.0（beta版）を公開
alpha版は4月上旬にフルパッケージで公開します。

It's only listed on the "back-number" page:

http://unidic.ninjal.ac.jp/back_number

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unidic design flaw #118

Unidic design flaw #118

wareya commented Aug 28, 2017 •

edited

fasiha commented Jan 18, 2018 •

edited

wareya commented Apr 2, 2018

cmoen commented Apr 2, 2018

wareya commented Apr 2, 2018 •

edited

Unidic design flaw #118

Unidic design flaw #118

Comments

wareya commented Aug 28, 2017 • edited

fasiha commented Jan 18, 2018 • edited

wareya commented Apr 2, 2018

cmoen commented Apr 2, 2018

wareya commented Apr 2, 2018 • edited

wareya commented Aug 28, 2017 •

edited

fasiha commented Jan 18, 2018 •

edited

wareya commented Apr 2, 2018 •

edited