New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect lemmatization with Japanese. #256
Comments
My database probably has hundreds of improper lemma's in it by now. I don't really mind it, but it would be nice to be able to spin through them once this gets fixed and apply proper lemma's to words where possible. It would probably require keeping a version of the current incorrect algo that does lemmatization and checking the database against it before issuing a correction. That way quashing the manually imputed lemmas could be avoided. |
Hi!
There is a planned import option that would allow you to override or fill up empty readings and lemmas.
That's actually a really smart idea! Sadly I think this issue will be with the tokenizer itself. But I'll check to see, maybe they are 2 words which I combined after the tokenizer process, in which case I can and will fix it. There's the same issue present with readings for combined words. If this is due the post processing and combining words, it might take a while to address this, because the post processing will have to be moved from PHP to Python, so it can access the lemmatizer and reading generation quickly to correct those words after combining them.
I may won't have time to address them quickly, or it will be an unsolvable issue, but please do! I want to know about all the problems. If you would like to experiment with it in the meantime, the post processing is in the Thank you for the detailed bug report again! |
Ok thanks for your consideration. I'll keep reporting the issues I find. I appreciate you pointing out where to find where the code is. Sadly I've got a lot of projects piled up right now, so it'll be 4-5 months before I can do anything. |
Thats okay. I wasnt expecting you to fix it yourself, I just added it since I remember you knowing a lot about laravel in case you wanted to experiment with it for yourself. |
Good news. I checked the issue, and it is my post processing method, so it is fixable. Text:
Tokenized text:
I combined 落とせ | ませ | ん without recalculating its lemma and reading. It's the same problem as #120. I will probably fix it in v0.13 or v0.14. |
Well. I spent my day with it, but unfortunately it was a failure. I was able to generate the correct lemma from 落とせません. It gets split like this: 落とせ | ませ | ん. If I took the first word 落とせ and run lemmatization again it worked, and returned 落とす. But there are many cases where this method creates incorrect lemmas. Here are a few examples. The format is
I think I won't be able to fix this. :( |
I've done next to no research on this topic, but from a glance I don't think it's possible to do it with a simple algo like this. Pretty much every solution out there seems to use a dictionary based approach. The lowest effort solution, like so many times before is again, using Spacy in the python container. Even spaCy can't do the lemmatization without the help of the third party Sudachi lib. |
Hi
Should I continue reporting the Japanese lemmatization errors I find, or do you have enough to go on for now?
Featured in the screenshot bellow you can see how おとせる is not the proper lemma for おとせません, it should be おとす.
The text was updated successfully, but these errors were encountered: