Incorrect lemmatization with Japanese. #256

etherealite · 2024-05-11T09:11:07Z

Hi

Should I continue reporting the Japanese lemmatization errors I find, or do you have enough to go on for now?

Featured in the screenshot bellow you can see how おとせる is not the proper lemma for おとせません, it should be おとす.

etherealite · 2024-05-11T09:14:24Z

My database probably has hundreds of improper lemma's in it by now. I don't really mind it, but it would be nice to be able to spin through them once this gets fixed and apply proper lemma's to words where possible.

It would probably require keeping a version of the current incorrect algo that does lemmatization and checking the database against it before issuing a correction. That way quashing the manually imputed lemmas could be avoided.

simjanos-dev · 2024-05-11T09:31:27Z

Hi!

but it would be nice to be able to spin through them once this gets fixed and apply proper lemma's to words where possible.

There is a planned import option that would allow you to override or fill up empty readings and lemmas.

It would probably require keeping a version of the current incorrect algo that does lemmatization and checking the database against it before issuing a correction. That way quashing the manually imputed lemmas could be avoided.

That's actually a really smart idea!

Sadly I think this issue will be with the tokenizer itself. But I'll check to see, maybe they are 2 words which I combined after the tokenizer process, in which case I can and will fix it. There's the same issue present with readings for combined words.

If this is due the post processing and combining words, it might take a while to address this, because the post processing will have to be moved from PHP to Python, so it can access the lemmatizer and reading generation quickly to correct those words after combining them.

Should I continue reporting the Japanese lemmatization errors I find, or do you have enough to go on for now?

I may won't have time to address them quickly, or it will be an unsolvable issue, but please do! I want to know about all the problems.

If you would like to experiment with it in the meantime, the post processing is in the app/Models/TextBlock.php file's processTokenizedWords function.

Thank you for the detailed bug report again!

etherealite · 2024-05-11T10:13:41Z

Ok thanks for your consideration. I'll keep reporting the issues I find.

I appreciate you pointing out where to find where the code is. Sadly I've got a lot of projects piled up right now, so it'll be 4-5 months before I can do anything.

simjanos-dev · 2024-05-11T10:24:29Z

Sadly I've got a lot of projects piled up right now, so it'll be 4-5 months before I can do anything.

Thats okay. I wasnt expecting you to fix it yourself, I just added it since I remember you knowing a lot about laravel in case you wanted to experiment with it for yourself.

simjanos-dev · 2024-05-13T08:25:42Z

Good news. I checked the issue, and it is my post processing method, so it is fixable.

Text:

落とす

落とせません

Tokenized text:

array(6) {
  [0]=>
  object(stdClass)#1441 (7) {
    ["w"]=>
    string(9) "落とす"
    ["r"]=>
    string(9) "おとす"
    ["l"]=>
    string(9) "落とす"
    ["lr"]=>
    string(9) "おとす"
    ["pos"]=>
    string(4) "VERB"
    ["si"]=>
    int(0)
    ["g"]=>
    string(0) ""
  }
  [1]=>
  object(stdClass)#1463 (7) {
    ["w"]=>
    string(7) "NEWLINE"
    ["r"]=>
    string(7) "NEWLINE"
    ["l"]=>
    string(7) "newline"
    ["lr"]=>
    string(7) "newline"
    ["pos"]=>
    string(5) "PROPN"
    ["si"]=>
    int(0)
    ["g"]=>
    string(0) ""
  }
  [2]=>
  object(stdClass)#1479 (7) {
    ["w"]=>
    string(7) "NEWLINE"
    ["r"]=>
    string(7) "NEWLINE"
    ["l"]=>
    string(7) "newline"
    ["lr"]=>
    string(7) "newline"
    ["pos"]=>
    string(4) "NOUN"
    ["si"]=>
    int(1)
    ["g"]=>
    string(0) ""
  }
  [3]=>
  object(stdClass)#1474 (7) {
    ["w"]=>
    string(9) "落とせ"
    ["r"]=>
    string(9) "おとせ"
    ["l"]=>
    string(12) "落とせる"
    ["lr"]=>
    string(12) "おとせる"
    ["pos"]=>
    string(4) "VERB"
    ["si"]=>
    int(2)
    ["g"]=>
    string(0) ""
  }
  [4]=>
  object(stdClass)#1457 (7) {
    ["w"]=>
    string(6) "ませ"
    ["r"]=>
    string(6) "ませ"
    ["l"]=>
    string(6) "ます"
    ["lr"]=>
    string(6) "ます"
    ["pos"]=>
    string(3) "AUX"
    ["si"]=>
    int(2)
    ["g"]=>
    string(0) ""
  }
  [5]=>
  object(stdClass)#1456 (7) {
    ["w"]=>
    string(3) "ん"
    ["r"]=>
    string(3) "ん"
    ["l"]=>
    string(3) "ぬ"
    ["lr"]=>
    string(3) "ぬ"
    ["pos"]=>
    string(3) "AUX"
    ["si"]=>
    int(2)
    ["g"]=>
    string(0) ""
  }
}

I combined 落とせ | ませ | ん without recalculating its lemma and reading.

It's the same problem as #120. I will probably fix it in v0.13 or v0.14.

simjanos-dev · 2024-05-15T14:26:54Z

Well. I spent my day with it, but unfortunately it was a failure. I was able to generate the correct lemma from 落とせません.

It gets split like this: 落とせ | ませ | ん. If I took the first word 落とせ and run lemmatization again it worked, and returned 落とす. But there are many cases where this method creates incorrect lemmas.

Here are a few examples. The format is Word: old lemma -> new lemma.

2024-05-15 16:15:45 new lemma generated.  なりました :  なる  ->  なり
2024-05-15 16:15:49 new lemma generated.  聞いて :  聞く  ->  聞く
2024-05-15 16:15:53 new lemma generated.  かかって :  かかる  ->  かかる
2024-05-15 16:15:54 new lemma generated.  いました :  いる  ->  い
2024-05-15 16:16:12 new lemma generated.  います :  いる  ->  い
2024-05-15 16:16:16 new lemma generated.  手伝って :  手伝う  ->  手伝う
2024-05-15 16:16:22 new lemma generated.  吸って :  吸う  ->  吸う
2024-05-15 16:16:24 new lemma generated.  言って :  言う  ->  言う
2024-05-15 16:16:25 new lemma generated.  いました :  いる  ->  い
2024-05-15 16:16:34 new lemma generated.  作って :  作る  ->  作る
2024-05-15 16:16:35 new lemma generated.  いて :  いる  ->  い
2024-05-15 16:16:37 new lemma generated.  なって :  なる  ->  なっ

I think I won't be able to fix this. :(

etherealite · 2024-05-20T10:37:08Z

I've done next to no research on this topic, but from a glance I don't think it's possible to do it with a simple algo like this. Pretty much every solution out there seems to use a dictionary based approach.

The lowest effort solution, like so many times before is again, using Spacy in the python container. Even spaCy can't do the lemmatization without the help of the third party Sudachi lib.

simjanos-dev added the bug Something isn't working label May 11, 2024

simjanos-dev changed the title ~~Found another lemmatization error with Japanese - should I continue reporting?~~ Incorrect lemmatization with Japanese. May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect lemmatization with Japanese. #256

Incorrect lemmatization with Japanese. #256

etherealite commented May 11, 2024

etherealite commented May 11, 2024 •

edited

simjanos-dev commented May 11, 2024 •

edited

etherealite commented May 11, 2024

simjanos-dev commented May 11, 2024

simjanos-dev commented May 13, 2024

simjanos-dev commented May 15, 2024 •

edited

etherealite commented May 20, 2024 •

edited

Incorrect lemmatization with Japanese. #256

Incorrect lemmatization with Japanese. #256

Comments

etherealite commented May 11, 2024

etherealite commented May 11, 2024 • edited

simjanos-dev commented May 11, 2024 • edited

etherealite commented May 11, 2024

simjanos-dev commented May 11, 2024

simjanos-dev commented May 13, 2024

simjanos-dev commented May 15, 2024 • edited

etherealite commented May 20, 2024 • edited

etherealite commented May 11, 2024 •

edited

simjanos-dev commented May 11, 2024 •

edited

simjanos-dev commented May 15, 2024 •

edited

etherealite commented May 20, 2024 •

edited