New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parsing verb forms and helper verbs #109
Comments
I'm not 100% sure what you're talking about, but I think you're referring to the Yomichan feature where it tells you all the conjugations (if this is incorrect terminology, I apologize, I'm not a linguist) applied to a word. For your example, いけ is classified by Yomichan as imperative. I've attempted to implement this briefly, but gave up as the tokenizer Memento uses, MeCab, is very difficult to deal with in regards to how it presents data. I've also noticed no patterns in how it tokenizes conjugations. Yomichan has written their own tokenizer, so they don't have to deal with the shortcomings of MeCab. I am unfortunately not able to port it to C/C++ for use with Memento due to license incompatibility between the two projects. My options for solving this are:
It's parsing the form just fine, just unlike Yomichan, Memento doesn't highlight conjugations. This is due to how MeCab parses conjugated words.
This is an interesting example. When I put it into MeCab, the word tokenizes like this. This means 泳げる, 泳げなかっ, 泳げなかった are what’s being searched when Memento sees this word. There is no entry for any of these in jmdict, so no results are shown. This again, isn't something I can fix without writing my own tokenizer. We can see that Yomichan is special in the fact that it can correctly parse this as even Jisho doesn't turn up any results. https://jisho.org/search/%E6%B3%B3%E3%81%92%E3%81%AA%E3%81%8B%E3%81%A3%E3%81%9F Edit: I accidentally said 泳げ gets searched when I meant 泳げる. |
My understanding is that Yomichan interprets (explains) inflections with a fairly straightforward lookup table that matches grammar rules western instructions use with kana combinations: https://github.com/FooSoft/yomichan/blob/b40cfe0458f277b1153c3ebc6713305491dbec22/ext/data/deinflect.json, https://github.com/FooSoft/yomichan/blob/89ac85afd03e62818624b507c91569edbec54f3d/test/test-deinflector.js. I honestly don't quite think there's that much value in this. You can't get the same just from MeCab because MeCab's POS output is aligned with the more traditional 学校文法/橋本文法 that's taught to native speakers (i.e., you see 未然形 in the screenshot of the output; that's why the segmentation stopped at 怒ら in the second screenshot, because 怒られる is interpreted as the 未然形 of the verb 怒る and the 助動詞 れる). Since MeCab is not 100% correct all the time either, my preference is to not to rely on NLP tools for training wheels. As long as Memento's parser figures out the correct headword to look up I think it's all good. Which brings up one issue I see here that I realize I've also been having: in the case of e.g., 泳げなかった , Memento tends to fall back to looking up the kanji instead of the verb form. I understand that currently Memento searches 泳げ, 泳げなかっ, 泳げなかった and finds no results. Since MeCab returns the stem form of 泳げ in its output (泳げる), I think Memento might want to also search for that. Manually looking up 泳げる in Memento returns 泳ぐ, which is what we want. P.S., MeCab's output format should also be configurable: https://stackoverflow.com/questions/5578791/what-is-the-mecab-output-and-the-tagset |
@TiredBeeYYB Can you spell out exectly what you want done to resolve this issue? As far as I can tell, the only real problem is the parsing of 泳げなかった. |
When I created this issue, I wanted the feature of displaying conjugation and resolving the issue with missing words. I thought it was two sides of the same coin, looks like it's not |
Thank you for the clarification, I'll see what I can do. |
Please tell me how to adjust the sharpness of the video with the k key, it doesn’t work, it’s very necessary |
@borisorlov21 This isn't related to this issue at all. If you have a question, make a new issue. Every time you post in an unrelated issue it send an email to everyone who's commented in the thread. If you want to adjust video sharpness with keys, you will have to bind them in your Do not reply to this issue if you have questions. Make a new issue instead. |
Porting JL's deconjugator (which has a permissive license) is also an option. |
Apache 2.0 is incompatible with GPLv2 due to patent clauses in Apache. Writing my own deconjugator is possible, but implementing it would likely necessitate getting rid of MeCab. If I did that, text matching would likely resemble Yomichan's behavior more than MeCab's due to the use of look-up tables based on grammar rules. I'm not partial to either approach, but going off Calvin's post, MeCab's results are probably the more "correct" of the two options. Edit: I didn't notice you were the author. You have the option of dual licensing under a compatible license, but that's entirely your choice. |
As a stop-gap measure, one can generate dictionaries containing the inflected verb conjugations with the help of this small python script I wrote: https://github.com/precondition/Verb_inflections_JMDict |
This 1 missing feature is the difference between the perfect player and something I can't use at all |
I've found that the reason why Memento couldn't deconjugate any verb (e.g. it wouldn't be able to show the results for "怒る" if you hover over "怒られる" so all you'd get would be the kanji info for 怒) was due to the fact that my system MeCab dictionary is JumanDic instead of IPADic or NAIST-jdic — the As you can see, it seems to work even better than IPADic: Naturally, I don't have verb form information (e.g. "泳ぐ in the potential, negative, past form") but that's not as important. |
Sorry for the super late response. I'd gladly grant you the permission to use the deconjugator in question by dual licensing it but I don't know if I can legally do that, because JL's deconjugator (i.e., Deconjugator.cs, Form.cs, Rule.cs, VirtualRule.cs) is mostly just a C# rewrite of Nazeka's deconjugator, which is also under Apache 2.0 license. (https://github.com/wareya/nazeka/blob/master/dict/deconjugator.json is public domain data though so https://github.com/rampaa/JL/blob/master/JL.Core/Resources/deconjugation_rules.json can be considered as public domain data as well. The difference between those two files is that JL's deconjugation rules additionally supports v5aru, zuru verbs, su verbs (partially), v1-s, vs-s, v4r, ざる and できる -> する deconjugations) |
@precondition I did put that ipadic was the only supported mecab dictionary in the readme. The only reason I did that is because unidic was crashing Memento #35 since it doesn't do bounds checking on the split feature string. Not a problem for dictionaries that use ipadic's format like NAIST-jdic, but definitely an issue for unidic. The fact that ipadic isn't default on Ubuntu is surprising. This is entirely speculation, but I assume the maintainer saw that there were encoding issues with ipadic and made the default a dictionary that doesn't have encoding issues. Not building ipadic as UTF-8 seems to be a common trap for package maintainers to fall into #101 (comment). I'd say that this is probably an upstream bug for the Debian/Ubuntu MeCab maintainer to handle. I'd imagine getting the default dictionary changed to ipadic upstream in Ubuntu will be akin to pulling teeth though. I will either make a note for it in the troubleshooting section of the readme or maybe a code change since you are having to recompile to fix it. @rampaa I'll look into what it'll take to make it a separate library in C or C++ under LGPLv3. I do having manga-ocr support in Memento now which is interesting because manga-ocr is an Apache 2.0 licensed library. I (think) I could legally redistribute binaries since I abstracted the interaction with manga-ocr into a separate library under LGPLv3. I don't actually distribute binaries built with OCR support though, so even if I'm misunderstanding I haven't violated any licenses. Legal speculation aside, it'll probably be more useful as a separate library anyway. Thanks for the clarification and support. |
Bumping because this is also an issue I find to be a very serious barrier to using this player well. Otherwise it's perfect. Has a fix been integrated ? |
Would be very much appreciated, as making useful verb cards for anki is almost impossible without it. As mentioned above as well, for example -ます is not parsed correctly, which already immediately invalidates a significant percentage of conjugated verbs. EDIT:The issue seems to only be present in the flatpak version. Maybe that doesn't have access to MeCab? In any case, after building Memento locally it correctly scans according to MeCab. Some edge cases, such as
which properly finds |
怒られる appears to be parsed normally on my install (not FlatPak), so what you're seeing is definitely a bug. If Memento can't find the MeCab dictionary, it should throw up an error message that the use needs to click through before starting Memento. Since you're not reporting such a thing happening, the next most likely cause is the dictionary does exist, but there's some sort of encoding issue. This has been the cause of various problems on NixOS #101. I'll look into it more. Also thank you for mentioning the |
Thank you for taking the time!
I think you slightly misunderstood what I meant. On a non-flatpak install, it finds
So there are 2 completely separate issues at play here:
Thank you again for your time! EDIT:
which lists MeCab can be difficulty to get an initial grasp of, but it works leagues better than Yomichan ever could. It could be used to add tags or whatnot to the popup describing the exact verb form and the corresponding inflectional suffix etc. If you need help making sense of the output, I will gladly help. |
Just curious, do you have concrete examples in mind to back up this claim? |
Yomichan guesses what a word is based on how it looks, and then tries to deinflect it and match that with your dictionary. For example if you scan 買って, Yomi will see て and search the dictionary for 買う, 買る, and 買つ. When it finds 買うas a 五段 verb it tells you that it's the て form of 買う. 買います is looked up in a similar fashion, and then it tells you that it's the ます stem of 買う. MeCab on the other hand has an actual dictionary behind it audited by Japanese linguists. (JUMANDic by Kyoto University, IPADic by the IPA, UniDic by NINJAL) To give you a direct comparison:
For the case I mentioned in my previous comment with
In particular this part: MeCab can be a bit unwieldy at first glance, but it is very powerful, extensive and pretty much guaranteed to be correct. |
@BigBoyBarney The Flatpak version has been updated to fix the ipadic encoding issue. Thank you for reporting it. |
I don't understand what you're suggesting here. Do you mean to parse the output of mecab to derive a yomichan style inflection explainations (
I don't see what's wrong with this approach. It will find all possible entries in the dictionary unless I'm misunderstanding something. Using mecab is not guarrenteed to be correct. If a form is missing from the mecab dictionary, it will not be able to find it in the external dictionary. For example, 労った is the past tense of both 労う(ねぎらう) and 労る(いたわる), Mecab+Jumandic finds both but Mecab+Ipadic only finds 労う(ねぎらう). 貴んだ is the past tense of both 貴ぶ(とうとぶ) and 貴む(たっとむ), Mecab+Ipadic finds both but Mecab+Jumandic only finds 貴む(たっとむ). Yomichan is able to handle this just fine.
Is the issue here with Yomi that it uses western style grammar rules for it's parsing and inflection explainations rather then traditional native grammar?
Why is this useful if the only application of mecab is to lookup an external dictionary and display it to the user? |
@ripose-jp I'm interested in seeing inflection explanations implemented like yomichan has (the "potential < negative < past" entries in the definition). I've make a fork that replaces mecab with a deconjugator I wrote from scratch here https://github.com/spacehamster/Memento/tree/Deconjugation. I'm not sure if here is the best place to discuss it or in a new pull request, I wanted some feedback that the general approach was on the right track before I spend too much time ironing out all the issues and I don't feel it's currently polished enough to merge upstream. I used a deconjugator approach because:
I'm not opposed to working with mecab if that is preferable. I wrote it without looking at yomichan's code, though I did use yomichan itself to verify correctness and for display names, I don't believe that falls under yomichan's licensing. Some issues with the current implementation I'm aware of that I have delayed dealing with:
|
@spacehamster Looks awesome. This isn't the best place to go back and forth since everyone in the issue gets an email when a new post is made. Make a pull request and we can talk there. It looks really, really good so there's pretty much a 100% chance this is going to get merged in some form.
I think it's a better approach for the reasons you stated and I don't need to ship a 50MB dictionary file with my releases.
I chose MeCab since it was more or less a drop-in solution to the deconjugation problem. I'm more than happy to sideline it if something better comes along.
Great.
This seems like an easy thing to fix by buffering queries and removing duplicates. I already do this when handling MeCab queries, so it wouldn't be a big problem to deal with.
Take your time, there no rush.
For nonsense like 食べなくなくなく, I personally don't mind your deconjugator handling it since it's unlikely such a phrase would appear in subtitles. If the behavior exists, but the user isn't going to see it unless they go looking, I don't think it's a problem.
As long as it returns both possible results, either in two separate search results or by showing something like "Potential < Negative or Negative", it's fine.
I'd need to see an example of what you're talking about. It would be awkward for conjugation information to show up on a search result for 歩 when the query was 歩けない. It would not be strange for 歩 to show up though. Rather it's preferable so long as it appears below 歩く.
I'm going to have you put inflection explainations on its own line between the word and the term tags. Using a label with word wrap enabled should suffice. I'm worried about the color of the text since it should theme well while still being more contrasty than gray on gray. We can work on all this in the PR though. Dealing with the UI is the easy part. |
@spacehamster This is my last comment on this issue in order to not bother everyone that's subscribed to it. With all due respect, writing your own deinflector is a massive waste of time. It's a solved issue by the Japanese, there is no need to reinvent the wheel. If you want to make sure you catch everything, just use an appropriate version of Unidic. It has everything under the Sun. The basic spoken Japanese version finds all of the different versions of the verbs you mentioned and has none of the issues a western written deinflector would have, in addition to being roughly 10 billion times faster. I highly recommend reconsidering the western deinflector approach, and instead parsing the appropriate mecab output, as not only is it much less effort than writing a comprehensive deinflector, it's also more extensive, given the appropriate dictionary. |
Just commenting to say thank you for updating support for this. It's made the program really really useful for me. Thank you again! |
Sometimes it can parse verb form, but it doesn't show what form it is
Sometimes it can parse the word, but not the form
Sometimes it can't parse the word at all
The text was updated successfully, but these errors were encountered: