Replace mecab with a deconjugation parser and display inflection explanations on definitions #210

spacehamster · 2024-04-06T04:29:48Z

Related to #109

The feature is still in early stages and I'm opening a pull request to discuss it further.

I've created a graph to give a very quick broad overview of how the deconjugator currently functions.

It works by trying to find a path from any node to a terminal node (godan verb, ichidan verb, irregular verb or adjective). This graph only shows the godan verb for simplicity.

The rules were added ad-hoc as I encountered examples of them, so they aren't currently comprehensive.

One option moving forward is that instead of specifying each transition one by one (such as { u8"せる", u8"せられる", WordForm::causative, WordForm::potentialPassive } for causative < passive), a hidden transition can be added ({ u8"せる", u8"せる", WordForm::causative, WordForm::ichidanVerb},), so that any conjugations that can apply to ichidan verbs can be applied to the causative form, but this allows deconjugation of forms you'd never see in real life such as 話されさせる (passive < causative), which is in line with how yomichan seems to work.

It produces redundant queries to the database (eg 食べています will query 食べる multiple times for the derivations "-te < progressive or perfect < polite", "-te" and "masu stem")

This seems like an easy thing to fix by buffering queries and removing duplicates. I already do this when handling MeCab queries, so it wouldn't be a big problem to deal with.

I didn't expect this to be difficult, I haven't given it much consideration because it's low on the priorities and it ties into a few other points mentioned later. it seems to be a matter of determining which part of the code is responsible for filtering duplicates: the deconjugator itself, the dictionary before it queries to the db, or the dictionary after it receives terms from the db.

Order of results is unintuitive, returning 言う "Potential < Negative" before 言える "Negative" for the query 言えない.

As long as it returns both possible results, either in two separate search results or by showing something like "Potential < Negative or Negative", it's fine.

I picked this example because jmdict has two separate entries for 言う and 言える, so they should be presented as two separate search results. My concern was that the longer dictionary form should be shown first because it is more likely to be relevant. It might be more clear with the example 成り立ち in the sentence この学校の成り立ちをお話しましょう, the noun form 成り立ち is more relevant, but the verb form 成り立つ is shown first with the masu stem deconjugation. In this example the dictionary form is the same length so results of the same length from the ExactWorker should be priotized over the DeconjugationWorker.

Deconjugated queries will match nouns, need to be filtered somehow. Are terms guaranteed to have part-of-speech information attached?

I'd need to see an example of what you're talking about. It would be awkward for conjugation information to show up on a search result for 歩 when the query was 歩けない. It would not be strange for 歩 to show up though. Rather it's preferable so long as it appears below 歩く.

If you search しよう (to do, volitional), it will attempt to deconjugate しよう to しる using the volitional rule, and then will return 汁 (juice/sap). This would be a valid result if 汁 was a verb but it's a noun. Another example is けれど to く (imperative) and finds the noun 句. If part of speech tags aren't guaranteed, terms without okurigana can be assumed to be nouns and filtered that way.

Long inflection explanations are clipped by the anki buttons (see image). I don't know enough about QT UI to solve this, is it possible to put them on a new line only when they're too long, or should they be allowed to clip and have a mouse over tooltip so they can still be read?

I'm going to have you put inflection explainations on its own line between the word and the term tags. Using a label with word wrap enabled should suffice. I'm worried about the color of the text since it should theme well while still being more contrasty than gray on gray. We can work on all this in the PR though. Dealing with the UI is the easy part.

Because inflection explanations are usually less relevant to the user then the english translation I wanted to minimize the amount of vertical space it took up, but if this is difficult to do with QT, I don't think it's worth much effort trying.

spacehamster · 2024-04-06T05:54:47Z

@BigBoyBarney I still don't understand the extent of your suggestion. Do you mean to parse the output of mecab to derive a yomichan style inflection explainations (negative < tara for 温かくなかったら, potential < negative < past for 泳げなかった) , or to show the raw mecab tags (基本連用形 < タ系条件形 for 温かくなかったら, 未然形 < タ形 for 泳げなかった) ?

just use an appropriate version of Unidic

Unidic is pretty big at ~500 MB zipped, ~1GB unzipped for the light weight version.

has none of the issues a western written deinflector would have

I'm not clear on what those issues are.

in addition to being roughly 10 billion times faster

I don't believe performance of either mecab or a deinflector is an issue. I did some very quick benchmarks and they were on the same order of performance, and the dictionary queries as a whole were dwarfed by the Qt UI code that took like ~90-99% of the runtime.

Calvin-Xu · 2024-04-06T06:30:26Z

I don't really have any preferences on this feature as I have been fine without it until now. Just some of my thoughts here:

As I mentioned a long time ago in #109 (comment), I don't think it is very productive to try to make MeCab's output match western style explanations (often known collectively as 日本語教育文法, but differs between instructions) as MeCab's output is aligned with the more traditional 学校文法 that's taught to native speakers. Here's a table listing some of the differences:

excerpt taken from 考えて、解いて、学ぶ日本語教育の文法 (https://www.3anet.co.jp/np/books/4692/)

I really do think natural language has too many edge cases and no rule-based parser (MeCab included) has been accurate enough for me to have been able to rely on it during learning. If we really add this feature, and don't want to learn school grammar, I think a custom parser would probably be better than trying to shoehorn MeCab output.

However, to be honest I feel the most helpful thing in this day and age is probably to pass the context to some LLM server you have running and ask it to explain the usage of the word being looked up.

Calvin-Xu · 2024-04-06T07:15:20Z

For a very simple example where MeCab fails:

人に委ねるなってことですか？

MeCab with IPAdic fails to segment the text properly at "人に委ねるな", "ってことですか" and thinks there's a なる. Bad segmentation from MeCab is really the issue that affects Memento, though the addition of the search/lookup feature ameliorates that: if the word we want to look up cannot be selected, just type it with Ctrl-R

人	名詞,一般,*,*,*,*,人,ヒト,ヒト
に	助詞,格助詞,一般,*,*,*,に,ニ,ニ
委ねる	動詞,自立,*,*,一段,基本形,委ねる,ユダネル,ユダネル
なっ	動詞,自立,*,*,五段・ラ行,連用タ接続,なる,ナッ,ナッ
て	助詞,接続助詞,*,*,*,*,て,テ,テ
こと	名詞,非自立,一般,*,*,*,こと,コト,コト
です	助動詞,*,*,*,特殊・デス,基本形,です,デス,デス
か	助詞,副助詞／並立助詞／終助詞,*,*,*,*,か,カ,カ
？	記号,一般,*,*,*,*,？,？,？
EOS

with Unidic it works and is probably the best thing for inflections, but I also understand the reasons this project doesn't want to use it.

人	ヒト	ヒト	人	名詞-普通名詞-一般		

に	ニ	ニ	に	助詞-格助詞		

委ねる	ユダネル	ユダネル	委ねる	動詞-一般	下一段-ナ行	終止形-一般

な	ナ	ナ	な	助詞-終助詞		

って	ッテ	ッテ	って	助詞-副助詞		

こと	コト	コト	事	名詞-普通名詞-一般		

です	デス	デス	です	助動詞	助動詞-デス	終止形-一般

か	カ	カ	か	助詞-終助詞		

？			？	補助記号-句点		

+			＋	補助記号-一般		

EOS

BigBoyBarney · 2024-04-06T08:30:49Z

I don't think it is very productive to try to make MeCab's output match western style explanations
[...]
and don't want to learn school grammar

That was my initial point. Western grammar is not, and will not be applicable to Japanese to a satisfactory extent. I don't like Yomi's deinflection explanations either, so I was not suggesting implementing the exact same functionality with MeCab. This misunderstanding was possibly a mistake on my part, I should have clarified my point more.

no rule-based parser (MeCab included) has been accurate enough for me to have been able to rely on it during learning.
[...]
with Unidic it works

Unidic, given the appropriate dictionary (現代書き言葉, 現代話し言葉 or a version of 古文用 if you're reading classical Japanese), always works. It might not be the first entry that's applicable to the given context, but if you use a -N10 or similar flag to capture more outputs, I guarantee that the answer will be there.

I'm not clear on what those issues are.

My main gripe with it is that it's not how the language works. I would much prefer simply tabulating the MeCab output directly in a dropdown of some sort, so it can be viewed from within Memento. I'm already using it in a separate window while watching or reading, so this would streamline my workflow. However, I understand that a lot of people, allegedly the majority who use Memento, would prefer direct western grammar equivalents.

I only ask that if this ends up being merged, please keep the current MeCab parser as a togglable option as well.

Calvin-Xu · 2024-04-06T09:07:24Z

Ultimately prescriptive grammar are academic constructs and I personally do not believe it is necessary to learn either system to be able to accurately understand contemporary Japanese given one consumes enough content, which after all is the postulate of language acquisition through mass immersion.

I must say the case for 学校文法, besides most academic texts in Japanese use its terms, is that it is just indispensable for analyzing classical Japanese, whereas 教育文法 is only fitted on contemporary Japanese and does not generalize. Though I doubt that is the main use case for Memento by most users. @BigBoyBarney If you have a workflow and materials that you learn classical Japanese in Memento with ([lesson?] videos and accurate transcriptions / subtitles), please send me an email and I'd be very interested.

In that case again Memento could use better segmentation or just custom selection of text to look up, but that is a different matter.

Calvin-Xu · 2024-04-06T09:40:47Z

I only ask that if this ends up being merged, please keep the current MeCab parser as a togglable option as well.

I don't think there's a current MeCab parser option. Unless I've lost track of recent development, Memento uses MeCab to do segmentation only, to get the substring to look up when you mouse over it. I also hope this can still be an option.

This PR adds a "Deconjugator" which Memento never had before. @spacehamster actually how do you mean replacing MeCab with this? I assume you are not removing it. Are you giving progressively longer inputs to your deconjugator?

spacehamster · 2024-04-06T10:48:05Z

This PR adds a "Deconjugator" which Memento never had before. @spacehamster actually how do you mean replacing MeCab with this? I assume you are not removing it. Are you giving progressively longer inputs to your deconjugator?

The initial idea was to remove mecab completely. The code that uses mecab is currently commented out in the PR. Mecab is only used to find any possible deconjugated forms in a sentence, so if you give it 行かれる途中でしたか？, mecab will generate 行く, 行かれる途中です, 行かれる途中でる, 行かれる途中だ, 行かる which are then used to search the dictionary. It's more like partial segmentation because memento relies on the user cursor position to find the start of a word.

The deconjugator does the same thing, but generates an inflection explanations while it's at it.

Are you giving progressively longer inputs to your deconjugator?

Yes, for the sentence 行かれる途中でしたか？ it will generate the queries

行かれる途中でしたかる
行かれる途中でしたる
行かれる途中でしる
行かれる途中でする
行かれる途中です
行かれる途中でる
行かれる途中る
行かれる途る
行かれるる
行かる
行く
行かれる
行る

I've been thinking about it more, and I think it may be a good idea to leave mecab in and have it be an option to choose one method or the other. Classical japanese is a bit of an niche example, but I think a more likely scenario is obscure grammar points and dialects like kansai-ben, yomichan cannot parse stuff like 戦わざるをえない or 行かへん which mecab can manage.

Calvin-Xu · 2024-04-06T21:52:39Z

@spacehamster I think this sounds good. I personally won't be needing it to save some cycles due to how intensive Qt + mpv already is. I do wonder about the performance when starting to parse from the leftmost position. Though the length of a line should be bounded my dictionary db is almost 3GB.

More than anything I think we should sit back and make sense of the terminology in this discussion. First, MeCab is a tokenizer and morphological analyzer. If I understand correctly, your "deconjugator" is also a morphological analyzer that needs the stem/root form of the word.

You still use MeCab's morphological analyzer to get the stem, it's just you want to generate the western style explanation instead of using MeCab's output. Can you also clarify if/how you are using MeCab's tokenization?

I think we should clearly define things and figure out consistent terminology to use, then decide the direction we should take.

spacehamster · 2024-04-07T01:22:14Z

@Calvin-Xu I'm primarily motivated by seeing inflection explanations implemented. I used a deconjugator approach because:

Mecab can fail to find the dictionary form of words if they are missing from the mecab dictionaries, a real world example is failing to find いびる in the sentence パトリックのお嫁さんをイビったりしないから！. A hypothetical case is trying to search 労った, Mecab+Ipadic will only find 労う(ねぎらう) and not 労る(いたわる).
Deconjugating is more straightforward then trying to conjugate mecab's dictionary form and finding a conjugation that fits the sentence or parsing mecab tokens and trying to construct a yomichan style inflection explanation from the tokens.

If I understand correctly, your "deconjugator" is also a morphological analyzer that needs the stem/root form of the word.

I'm interpreting stem/root form to mean the plain/dictionary form such as 食べる and not 食べ. The deconjugator doesn't take the dictionary form as input, but emits it as output, so it replaces mecab's functionality and doesn't use mecab at all.

A pseudocode simplication of how searchTerms in dictionary.cpp currently works.

function searchTerms(query, subtitle)
{
    //query is substring of subtitle, in this example
    //query=行かれる途中でしたか？
    //subtitle=王立学園に行かれる途中でしたか？
    let terms = [];
    for(let i = 1; i < query.length; i++)
    {
        //Search for exact matches
        //行, 行か, 行かれ, etc
        terms += db.search(query.substring(0, i))
    }
    let mecabQueries = generateMecab(query);
    //generateMecab produces a list of surface and deconj forms using mecab
    //only deconj is used for searching the db, surface is used for highlighting the subtitle and clozebody
    //[{ deconj: "行く", surface: "行か" },
    // { deconj: "行かれる途中です", surface:"行かれる途中でし" },
    //...snip
    // { deconj: "行かる", surface:"行かれ" }]
    for(let mecabQuery of mecabQueries)
    {
        terms += db.search(mecabQuery);
    }
    terms = terms.sorted();
    emit termsChanged(terms); //Let popup widget know there are new terms
}

and the proposed change

function searchTerms(query, subtitle)
{
    //query is substring of subtitle, in this example
    //query=行かれる途中でしたか？
    //subtitle=王立学園に行かれる途中でしたか？
    let terms = [];
    for(let i = 1; i < query.length; i++)
    {
        //Search for exact matches
        //行, 行か, 行かれ, etc
        terms += db.search(query.substring(0, i))
    }
    let deconjugatorQueries = deconjugate(query);
    //deconjugate produces a list of deconj forms, surface forms, and inflection explainations
    //without using mecab
    //only deconj is used for searching the db, surface is used for highlighting the subtitle and clozebody
    //[{ deconj: "行かれる途中でしたかる", surface: "行かれる途中でしたか", explaination: "masu stem"　},
    // { deconj: "行かれる途中でしたる", surface: "行かれる途中でした", explaination: "masu stem"　},
    //...snip
    // {deconj: "行く" surface: "行かれる", explaination: "passive" },
    // {deconj: "行る" surface: "行", explaination: "masu stem" },
    // {deconj: "行かる" surface: "行かれ", explaination: "imperative" } ]
    for(let deconjugatorQuery of deconjugatorQueries)
    {
        terms += db.search(deconjugatorQuery);
    }
    terms = terms.sorted();
    emit termsChanged(terms); //Let popup widget know there are new terms
}

I do wonder about the performance when starting to parse from the leftmost position.

Are you imagining the search is done progressively, showing the user results one by one as they are found? The current implementation (both mecab and deconjugator) searches the database in one go and collects all the definitions before showing them to the user.

Though the length of a line should be bounded my dictionary db is almost 3GB.

Searches from mousing over subtitles currently bounded with a max length of 37 characters (for both mecab and deconjugator). I don't believe the search widget has a bound.

I did some quick performance logging for what I believe is a typical case, which was enough to convince me performance was not a major concern as the QT UI was the main bottleneck. I have not looked at worst case pathological cases, my sqlite dictionary is 600 MB so it may get worse if it's 3GB with very large inputs.

mecab_perf_log.txt

deconjugator_perf_log.txt

I think it would be a good idea to have a global settings flag to switch between the original method and the new method, because there are cases where a deconjugator will fail like kansai-ben 行かへん, but as a bonus it would also allow you to compare performance side by side.

Calvin-Xu · 2024-04-07T10:27:23Z

Great. Let's just wait for what @ripose-jp thinks. I think this accomplishes the goal of having the feature Yomichan/Yomitan has. Given the renewed interest in UniDic, I have opened another issue on supporting it #211

ripose-jp · 2024-04-12T04:09:35Z

I'm not interested in discussing the merits of 教育文法 vs 学校文法 because I am unqualified to do so. I read Tae Kim's Grammar Guide and have mostly just learned by feel since. That said, if this works well, it will replace MeCab as the default. This style of deconjugation is what most Memento users will likely want as it is more consistent with the way most people learn Japanese as a second language. I'll probably make enabling MeCab support a compile time option so I can drop the dependency on the binaries I distribute. I promise that MeCab support will remain for those that want it.

To achieve this multiple deconjugation approach, I believe it's probably for the best that I rework how searching currently works. Right now it's a bit of a rats nest that hard codes a start to finish algorithm. I think a pipeline approach would be best. It would look something like

Query Generator → Query Merge → Duplicate Filter → Database Query → Results

The query generation step would be the most relevant to this PR. I'll probably make an interface class that query generators can implement. Then I'll rework the code so existing generators are implemented in something like ExactGenerator, MeCabGenerator, and MultiGenerator. This way query generators are much more modular and don't require so much additional code to implement. This PR would be mostly contained to creating a DeconjugationGenerator save for a few potential changes to the Database Query step to fix some problems with the PR. This also opens the door to other generators such as a UniDic generator in the future. I don't expect you to implement this yourself, so I ask that you hold off sweating the details for how your code fits into Memento at the moment. I'll try to get this done this weekend.

the noun form 成り立ち is more relevant, but the verb form 成り立つ is shown first with the masu stem deconjugation

I agree. There is probably something that can be changed in the sorting algorithm to make this happen. I don't expect it to be too difficult.

If you search しよう (to do, volitional), it will attempt to deconjugate しよう to しる using the volitional rule, and then will return 汁 (juice/sap). This would be a valid result if 汁 was a verb but it's a noun. Another example is けれど to く (imperative) and finds the noun 句. If part of speech tags aren't guaranteed, terms without okurigana can be assumed to be nouns and filtered that way.

Yomichan does a fairly poor job of finding しよう as する considering it doesn't even rank. I don't think this is such a big deal considering that しよう is fairly common word that happens to be an exception to standard rules, so most people will learn it on it's own rather than in the greater grammatical framework. Likewise, most dictionaries will have an entry for けれど, so as long as that ranks above any irrelevant results, I believe it's not a big issue. Of course, try to fix these things if you can, but don't sweat them if you can't.

Because inflection explanations are usually less relevant to the user then the english translation I wanted to minimize the amount of vertical space it took up, but if this is difficult to do with QT, I don't think it's worth much effort trying.

Your deconjugator is providing valuable information, so the vertical space is well warranted.

Overall, great PR and I have no reason to believe this won't be merged eventually.

spacehamster · 2024-04-12T11:10:02Z

I've pushed the current state that I have that has some very minor changes, i've been holding off making any big changes. I've added some more deconjugation forms. I added a quick and easy query filter because the duplicates were annoying me. I added a config flag that allows for choosing between mecab and deconjugator, I don't think this is a great option for normal users because it requires technical understanding of the difference between parsing methods to know what it does, but I found it useful to switch between the two quickly for testing.

Create a new class called QueryGenerator that defines a class that can generate queries from a string of text. ExactQueryGenerator implements this for searching exact text, and MeCabQueryGenerator implements this for deconjuating text with MeCab. This makes it so these two methods of generating queries can share all the same code for filtering duplicates, generating cloze info, and searching the database. This also makes it easier to implement new query generators in the future such as the deconjuator in #210 or the proposed MeCab with UniDic option in #211. This commit removes all multithreading from searches. I have known for a while that it provides little to no performance imporvement due to most time being spent querying a synchronized database. It may also harm performance on Windows. I don't see any need to add it back, but it can be if there is a compelling reason to. Queries seem plenty fast with this change.