Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect lemmatization with Japanese. #256

Open
etherealite opened this issue May 11, 2024 · 7 comments
Open

Incorrect lemmatization with Japanese. #256

etherealite opened this issue May 11, 2024 · 7 comments
Labels
bug Something isn't working

Comments

@etherealite
Copy link

Hi

Should I continue reporting the Japanese lemmatization errors I find, or do you have enough to go on for now?

Featured in the screenshot bellow you can see how おとせる is not the proper lemma for おとせません, it should be おとす.

image

@etherealite
Copy link
Author

etherealite commented May 11, 2024

My database probably has hundreds of improper lemma's in it by now. I don't really mind it, but it would be nice to be able to spin through them once this gets fixed and apply proper lemma's to words where possible.

It would probably require keeping a version of the current incorrect algo that does lemmatization and checking the database against it before issuing a correction. That way quashing the manually imputed lemmas could be avoided.

@simjanos-dev
Copy link
Owner

simjanos-dev commented May 11, 2024

Hi!

but it would be nice to be able to spin through them once this gets fixed and apply proper lemma's to words where possible.

There is a planned import option that would allow you to override or fill up empty readings and lemmas.

It would probably require keeping a version of the current incorrect algo that does lemmatization and checking the database against it before issuing a correction. That way quashing the manually imputed lemmas could be avoided.

That's actually a really smart idea!

Sadly I think this issue will be with the tokenizer itself. But I'll check to see, maybe they are 2 words which I combined after the tokenizer process, in which case I can and will fix it. There's the same issue present with readings for combined words.

If this is due the post processing and combining words, it might take a while to address this, because the post processing will have to be moved from PHP to Python, so it can access the lemmatizer and reading generation quickly to correct those words after combining them.

Should I continue reporting the Japanese lemmatization errors I find, or do you have enough to go on for now?

I may won't have time to address them quickly, or it will be an unsolvable issue, but please do! I want to know about all the problems.

If you would like to experiment with it in the meantime, the post processing is in the app/Models/TextBlock.php file's processTokenizedWords function.

Thank you for the detailed bug report again!

@simjanos-dev simjanos-dev added the bug Something isn't working label May 11, 2024
@simjanos-dev simjanos-dev changed the title Found another lemmatization error with Japanese - should I continue reporting? Incorrect lemmatization with Japanese. May 11, 2024
@etherealite
Copy link
Author

Ok thanks for your consideration. I'll keep reporting the issues I find.

I appreciate you pointing out where to find where the code is. Sadly I've got a lot of projects piled up right now, so it'll be 4-5 months before I can do anything.

@simjanos-dev
Copy link
Owner

Sadly I've got a lot of projects piled up right now, so it'll be 4-5 months before I can do anything.

Thats okay. I wasnt expecting you to fix it yourself, I just added it since I remember you knowing a lot about laravel in case you wanted to experiment with it for yourself.

@simjanos-dev
Copy link
Owner

Good news. I checked the issue, and it is my post processing method, so it is fixable.

Text:

落とす

落とせません

Tokenized text:

array(6) {
  [0]=>
  object(stdClass)#1441 (7) {
    ["w"]=>
    string(9) "落とす"
    ["r"]=>
    string(9) "おとす"
    ["l"]=>
    string(9) "落とす"
    ["lr"]=>
    string(9) "おとす"
    ["pos"]=>
    string(4) "VERB"
    ["si"]=>
    int(0)
    ["g"]=>
    string(0) ""
  }
  [1]=>
  object(stdClass)#1463 (7) {
    ["w"]=>
    string(7) "NEWLINE"
    ["r"]=>
    string(7) "NEWLINE"
    ["l"]=>
    string(7) "newline"
    ["lr"]=>
    string(7) "newline"
    ["pos"]=>
    string(5) "PROPN"
    ["si"]=>
    int(0)
    ["g"]=>
    string(0) ""
  }
  [2]=>
  object(stdClass)#1479 (7) {
    ["w"]=>
    string(7) "NEWLINE"
    ["r"]=>
    string(7) "NEWLINE"
    ["l"]=>
    string(7) "newline"
    ["lr"]=>
    string(7) "newline"
    ["pos"]=>
    string(4) "NOUN"
    ["si"]=>
    int(1)
    ["g"]=>
    string(0) ""
  }
  [3]=>
  object(stdClass)#1474 (7) {
    ["w"]=>
    string(9) "落とせ"
    ["r"]=>
    string(9) "おとせ"
    ["l"]=>
    string(12) "落とせる"
    ["lr"]=>
    string(12) "おとせる"
    ["pos"]=>
    string(4) "VERB"
    ["si"]=>
    int(2)
    ["g"]=>
    string(0) ""
  }
  [4]=>
  object(stdClass)#1457 (7) {
    ["w"]=>
    string(6) "ませ"
    ["r"]=>
    string(6) "ませ"
    ["l"]=>
    string(6) "ます"
    ["lr"]=>
    string(6) "ます"
    ["pos"]=>
    string(3) "AUX"
    ["si"]=>
    int(2)
    ["g"]=>
    string(0) ""
  }
  [5]=>
  object(stdClass)#1456 (7) {
    ["w"]=>
    string(3) "ん"
    ["r"]=>
    string(3) "ん"
    ["l"]=>
    string(3) "ぬ"
    ["lr"]=>
    string(3) "ぬ"
    ["pos"]=>
    string(3) "AUX"
    ["si"]=>
    int(2)
    ["g"]=>
    string(0) ""
  }
}

I combined 落とせ | ませ | ん without recalculating its lemma and reading.

It's the same problem as #120. I will probably fix it in v0.13 or v0.14.

@simjanos-dev
Copy link
Owner

simjanos-dev commented May 15, 2024

Well. I spent my day with it, but unfortunately it was a failure. I was able to generate the correct lemma from 落とせません.

It gets split like this: 落とせ | ませ | ん. If I took the first word 落とせ and run lemmatization again it worked, and returned 落とす. But there are many cases where this method creates incorrect lemmas.

Here are a few examples. The format is Word: old lemma -> new lemma.

2024-05-15 16:15:45 new lemma generated.  なりました :  なる  ->  なり
2024-05-15 16:15:49 new lemma generated.  聞いて :  聞く  ->  聞く
2024-05-15 16:15:53 new lemma generated.  かかって :  かかる  ->  かかる
2024-05-15 16:15:54 new lemma generated.  いました :  いる  ->  い
2024-05-15 16:16:12 new lemma generated.  います :  いる  ->  い
2024-05-15 16:16:16 new lemma generated.  手伝って :  手伝う  ->  手伝う
2024-05-15 16:16:22 new lemma generated.  吸って :  吸う  ->  吸う
2024-05-15 16:16:24 new lemma generated.  言って :  言う  ->  言う
2024-05-15 16:16:25 new lemma generated.  いました :  いる  ->  い
2024-05-15 16:16:34 new lemma generated.  作って :  作る  ->  作る
2024-05-15 16:16:35 new lemma generated.  いて :  いる  ->  い
2024-05-15 16:16:37 new lemma generated.  なって :  なる  ->  なっ

I think I won't be able to fix this. :(

@etherealite
Copy link
Author

etherealite commented May 20, 2024

I've done next to no research on this topic, but from a glance I don't think it's possible to do it with a simple algo like this. Pretty much every solution out there seems to use a dictionary based approach.

The lowest effort solution, like so many times before is again, using Spacy in the python container. Even spaCy can't do the lemmatization without the help of the third party Sudachi lib.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants