New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MeCab UniDic support #211
Comments
Thank you for this! How MeCab worksJust to make sure that everybody reading this issue is on the same page regarding how MeCab works, I will try to concisely explain it: Different dictionaries structure their output differently, so whatever Memento implements will have to be dictionary specific. Ipadic is by far the worst one available. Jumandic works most of the time out of the box, Unidic can be made to work every time, but it is currently not a drop in replacement. Let's look at 2 examples with
The first result from Jumandic returns 感ずる, the second returns 感じる. It is obviously possible to find both of these forms in JMdict just by looking up the stem, but that is currently besides the point. Accounting for multiple results from MeCab at index 4 gives us the verbs, along with the inflections. UniDic
UniDic seems overly verbose and scary at first, but it is very well documented. Columns of particular interest are index 7 and 10. 語彙素 and 書字形基本形 respectively. Once again, while it would be possible to find 感じる in JMdict from the first result alone, but the second result gives it directly, with the appropriate inflections.
This was possibly solved sometime in the past, because setting UniDic as my system dictionary and changing: #define WORD_INDEX 6 to #define WORD_INDEX 7 in With the current MeCab implementation, Jumandic with However, for a proper grammar implementation, I suggest basing it on Unidic's format, as it has numerous dictionaries for both modern and classical Japanese. My C++ knowledge is somewhat limited, and I'm almost entirely unfamiliar with how Memento currently works, but I would gladly help with parsing the output. EDIT:
and Unidic, Memento does indeed crash. Removing |
I'm not against it, but there's no way I'm bundling UniDic with Memento due to the size. I doubt I'll be changing away from ipadic either. This is because every distro I know of bundles ipadic as the default MeCab dictionary. Since dictionary formats can't be handled the same way across dictionaries, the safest option is to use ipadic as it has already proven itself as reliable. That said, I'm not against having an option to use UniDic over ipadic. The implementation details of these are something that I'm more interested in discussing in #210 than here. It would definitely involve having the user download UniDic after the fact in some fashion. |
I agree. Even the current yt-dlp on Mac kind of workaround would be appreciated. |
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.
Given the recent discussion in #109 touching on how UniDic may offer better tokenization and possibly morphological analysis (offerings #210 but Japanese school grammar), I wonder if it might to good to consider supporting UniDic in Memento again. For a overview of the different tokenizer dictionaries, see https://www.dampfkraft.com/nlp/japanese-tokenizer-dictionaries.html. tldr:
The previous discussion about supporting UniDic were in #35 and in #109. The main issues are:
@ripose-jp has said that
If just using UniDic can improve the experience, it may be beneficial for UniDic specifically to be supported since it seems be the only thing worth using over IPADic. (edit: to clarify I don’t mean it should be bundled / default due to the large size).
As a side note on things newer than MeCab (not maintained): Sudachi's Apache 2.0 license is incompatible with GPLv2 of this project.
The text was updated successfully, but these errors were encountered: