Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MeCab UniDic support #211

Open
Calvin-Xu opened this issue Apr 7, 2024 · 3 comments
Open

MeCab UniDic support #211

Calvin-Xu opened this issue Apr 7, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@Calvin-Xu
Copy link
Contributor

Calvin-Xu commented Apr 7, 2024

Given the recent discussion in #109 touching on how UniDic may offer better tokenization and possibly morphological analysis (offerings #210 but Japanese school grammar), I wonder if it might to good to consider supporting UniDic in Memento again. For a overview of the different tokenizer dictionaries, see https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html. tldr:

Unidic also offers dictionaries for spoken and historical language, so you can use the same tools that work with modern written Japanese on those if you need to.

Despite the recent challenges, since UniDic is the only dictionary actively and manually maintained, and since the maintainers oversee the Univeral Dependencies project for Japanese, it's the best choice for a base dictionary at time of writing.

The previous discussion about supporting UniDic were in #35 and in #109. The main issues are:

  • UniDic is much larger and would mean larger bundles and needing to use Git LFS (if we keep a copy in this repo like currently for IPADic)
  • UniDic has formatting issues that currently crashes Memento

@ripose-jp has said that

I did put that ipadic was the only supported mecab dictionary in the readme. The only reason I did that is because unidic was crashing Memento #35 since it doesn't do bounds checking on the split feature string. Not a problem for dictionaries that use ipadic's format like NAIST-jdic, but definitely an issue for unidic.

MeCab dictionaries shouldn't be user changable in 99% of cases

If just using UniDic can improve the experience, it may be beneficial for UniDic specifically to be supported since it seems be the only thing worth using over IPADic. (edit: to clarify I don’t mean it should be bundled / default due to the large size).

As a side note on things newer than MeCab (not maintained): Sudachi's Apache 2.0 license is incompatible with GPLv2 of this project.

@BigBoyBarney
Copy link

BigBoyBarney commented Apr 7, 2024

Thank you for this!
I agree that UniDic is the best choice, however, bundling the dictionary by default would most likely be an unwelcome change for most users. There is no need to host UniDic dictionaries in / by Memento, as they're actively maintained by NINJAL. Perhaps there could be a button to fetch the dictionary from NINJAL's site, if the user wishes to do so.

How MeCab works

Just to make sure that everybody reading this issue is on the same page regarding how MeCab works, I will try to concisely explain it:
Mecab segments a text based on the inflections, stems and word forms provided by a dictionary. This means that different dictionaries can segment differently. Additionally, since natural languages aren't perfectly rule based, a text string can have multiple, grammatically correct segmentations. MeCab segments every possible permutation and only returns the first one by default. The number of segmentations returned can be controlled with the -N[X] flag. E.g.: -N3 to return the first 3 results. MeCab only truncates, so returning more results incurs 0 performance cost.

Different dictionaries structure their output differently, so whatever Memento implements will have to be dictionary specific. Ipadic is by far the worst one available. Jumandic works most of the time out of the box, Unidic can be made to work every time, but it is currently not a drop in replacement.


Let's look at 2 examples with 感じたんだ:
Jumandic

echo 感じたんだ | mecab -N2 --dicdir /usr/lib64/mecab/dic/jumandic

感じた	動詞,*,ザ変動詞,タ形,感ずる,かんじた,代表表記:感ずる
んだ	助動詞,*,ナ形容詞,基本形,んだ,んだ,*
EOS

感じた	動詞,*,母音動詞,タ形,感じる,かんじた,補文ト 代表表記:感じる
んだ	助動詞,*,ナ形容詞,基本形,んだ,んだ,*
EOS

The first result from Jumandic returns 感ずる, the second returns 感じる. It is obviously possible to find both of these forms in JMdict just by looking up the stem, but that is currently besides the point. Accounting for multiple results from MeCab at index 4 gives us the verbs, along with the inflections.

UniDic

echo 感じたんだ | mecab -N2 --dicdir /usr/lib64/mecab/dic/unidic_spoken

感じ	動詞,一般,*,*,サ行変格,連用形-一般,カンズル,感ずる,感じ,カンジ,感ずる,カンズル,混,*,*,*,*,*,*,用,カンジ,カンズル,カンジ,カンズル,0,C2,M4@1,2106681492382337,7664
た	助動詞,*,*,*,助動詞-タ,連体形-一般,タ,た,た,タ,た,タ,和,*,*,*,*,*,*,助動,タ,タ,タ,タ,*,"動詞%F2@1,形容詞%F4@-2",*,5948916285711041,21642
ん	助詞,準体助詞,*,*,*,*,ノ,の,ん,ン,ん,ン,和,*,*,*,*,*,*,準助,ン,ン,ン,ン,*,"動詞%F2@0,形容詞%F2@-1",*,7968727735869952,28990
だ	助動詞,*,*,*,助動詞-ダ,終止形-一般,ダ,だ,だ,ダ,だ,ダ,和,*,*,*,*,*,*,助動,ダ,ダ,ダ,ダ,*,名詞%F1,*,6299110739157675,22916
EOS

感じ	動詞,一般,*,*,上一段-ザ行,連用形-一般,カンズル,感ずる,感じ,カンジ,感じる,カンジル,混,*,*,*,*,*,*,用,カンジ,カンジル,カンジ,カンジル,0,C2,M4@1,2106672902447745,7664
た	助動詞,*,*,*,助動詞-タ,連体形-一般,タ,た,た,タ,た,タ,和,*,*,*,*,*,*,助動,タ,タ,タ,タ,*,"動詞%F2@1,形容詞%F4@-2",*,5948916285711041,21642
ん	助詞,準体助詞,*,*,*,*,ノ,の,ん,ン,ん,ン,和,*,*,*,*,*,*,準助,ン,ン,ン,ン,*,"動詞%F2@0,形容詞%F2@-1",*,7968727735869952,28990
だ	助動詞,*,*,*,助動詞-ダ,終止形-一般,ダ,だ,だ,ダ,だ,ダ,和,*,*,*,*,*,*,助動,ダ,ダ,ダ,ダ,*,名詞%F1,*,6299110739157675,22916
EOS

UniDic seems overly verbose and scary at first, but it is very well documented. Columns of particular interest are index 7 and 10. 語彙素 and 書字形基本形 respectively. Once again, while it would be possible to find 感じる in JMdict from the first result alone, but the second result gives it directly, with the appropriate inflections.


UniDic has formatting issues that currently crashes Memento

This was possibly solved sometime in the past, because setting UniDic as my system dictionary and changing:

#define WORD_INDEX 6

to

#define WORD_INDEX 7

in dictionary.cpp line 269 makes Memento use UniDic, and it works mostly correctly. (Although as I mentioned earlier, since it only looks at column 7, it only finds 感ずる and not 感じる. Important to note that this is not a shortcoming of Mecab, it is simply not fully implemented yet.

With the current MeCab implementation, Jumandic with #define WORD_INDEX 4 is a drop in upgrade to ipadic in every way, since it is a more complete dictionary. Even though Unidic is far more extensive, the way it orders the results happens to be less compatible with the current MeCab implementation in Memento.

However, for a proper grammar implementation, I suggest basing it on Unidic's format, as it has numerous dictionaries for both modern and classical Japanese.


My C++ knowledge is somewhat limited, and I'm almost entirely unfamiliar with how Memento currently works, but I would gladly help with parsing the output.


EDIT:
I can confirm that with your example subtitle of

〝民間南極観測隊
3年ぶりに派遣決まる〞

and Unidic, Memento does indeed crash. Removing from the subtitles fixed this issue. I'm not entirely sure why it happens, as Unidic itself handles just fine. However, sanitising the subtitles is a reasonable solution.

@ripose-jp
Copy link
Owner

I'm not against it, but there's no way I'm bundling UniDic with Memento due to the size. I doubt I'll be changing away from ipadic either. This is because every distro I know of bundles ipadic as the default MeCab dictionary. Since dictionary formats can't be handled the same way across dictionaries, the safest option is to use ipadic as it has already proven itself as reliable.

That said, I'm not against having an option to use UniDic over ipadic. The implementation details of these are something that I'm more interested in discussing in #210 than here. It would definitely involve having the user download UniDic after the fact in some fashion.

@Calvin-Xu
Copy link
Contributor Author

It would definitely involve having the user download UniDic after the fact in some fashion

I agree. Even the current yt-dlp on Mac kind of workaround would be appreciated.

ripose-jp added a commit that referenced this issue Apr 14, 2024
Create a new class called QueryGenerator that defines a class that can
generate queries from a string of text. ExactQueryGenerator implements
this for searching exact text, and MeCabQueryGenerator implements this
for deconjuating text with MeCab. This makes it so these two methods of
generating queries can share all the same code for filtering duplicates,
generating cloze info, and searching the database. This also makes it
easier to implement new query generators in the future such as the
deconjuator in #210 or the proposed MeCab with UniDic option in #211.

This commit removes all multithreading from searches. I have known for
a while that it provides little to no performance imporvement due to
most time being spent querying a synchronized database. It may also harm
performance on Windows. I don't see any need to add it back, but it can
be if there is a compelling reason to. Queries seem plenty fast with
this change.
ripose-jp added a commit that referenced this issue Apr 14, 2024
Create a new class called QueryGenerator that defines a class that can
generate queries from a string of text. ExactQueryGenerator implements
this for searching exact text, and MeCabQueryGenerator implements this
for deconjuating text with MeCab. This makes it so these two methods of
generating queries can share all the same code for filtering duplicates,
generating cloze info, and searching the database. This also makes it
easier to implement new query generators in the future such as the
deconjuator in #210 or the proposed MeCab with UniDic option in #211.

This commit removes all multithreading from searches. I have known for
a while that it provides little to no performance imporvement due to
most time being spent querying a synchronized database. It may also harm
performance on Windows. I don't see any need to add it back, but it can
be if there is a compelling reason to. Queries seem plenty fast with
this change.
ripose-jp added a commit that referenced this issue Apr 14, 2024
Create a new class called QueryGenerator that defines a class that can
generate queries from a string of text. ExactQueryGenerator implements
this for searching exact text, and MeCabQueryGenerator implements this
for deconjuating text with MeCab. This makes it so these two methods of
generating queries can share all the same code for filtering duplicates,
generating cloze info, and searching the database. This also makes it
easier to implement new query generators in the future such as the
deconjuator in #210 or the proposed MeCab with UniDic option in #211.

This commit removes all multithreading from searches. I have known for
a while that it provides little to no performance imporvement due to
most time being spent querying a synchronized database. It may also harm
performance on Windows. I don't see any need to add it back, but it can
be if there is a compelling reason to. Queries seem plenty fast with
this change.
ripose-jp added a commit that referenced this issue Apr 14, 2024
Create a new class called QueryGenerator that defines a class that can
generate queries from a string of text. ExactQueryGenerator implements
this for searching exact text, and MeCabQueryGenerator implements this
for deconjuating text with MeCab. This makes it so these two methods of
generating queries can share all the same code for filtering duplicates,
generating cloze info, and searching the database. This also makes it
easier to implement new query generators in the future such as the
deconjuator in #210 or the proposed MeCab with UniDic option in #211.

This commit removes all multithreading from searches. I have known for
a while that it provides little to no performance imporvement due to
most time being spent querying a synchronized database. It may also harm
performance on Windows. I don't see any need to add it back, but it can
be if there is a compelling reason to. Queries seem plenty fast with
this change.
ripose-jp added a commit that referenced this issue Apr 14, 2024
Create a new class called QueryGenerator that defines a class that can
generate queries from a string of text. ExactQueryGenerator implements
this for searching exact text, and MeCabQueryGenerator implements this
for deconjuating text with MeCab. This makes it so these two methods of
generating queries can share all the same code for filtering duplicates,
generating cloze info, and searching the database. This also makes it
easier to implement new query generators in the future such as the
deconjuator in #210 or the proposed MeCab with UniDic option in #211.

This commit removes all multithreading from searches. I have known for
a while that it provides little to no performance imporvement due to
most time being spent querying a synchronized database. It may also harm
performance on Windows. I don't see any need to add it back, but it can
be if there is a compelling reason to. Queries seem plenty fast with
this change.
ripose-jp added a commit that referenced this issue Apr 14, 2024
Create a new class called QueryGenerator that defines a class that can
generate queries from a string of text. ExactQueryGenerator implements
this for searching exact text, and MeCabQueryGenerator implements this
for deconjuating text with MeCab. This makes it so these two methods of
generating queries can share all the same code for filtering duplicates,
generating cloze info, and searching the database. This also makes it
easier to implement new query generators in the future such as the
deconjuator in #210 or the proposed MeCab with UniDic option in #211.

This commit removes all multithreading from searches. I have known for
a while that it provides little to no performance imporvement due to
most time being spent querying a synchronized database. It may also harm
performance on Windows. I don't see any need to add it back, but it can
be if there is a compelling reason to. Queries seem plenty fast with
this change.
@ripose-jp ripose-jp added the enhancement New feature or request label Apr 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants