Chinese edition extraction contains entries with no part-of-speech #471

tatuylonen · 2024-01-26T00:42:50Z

The Chinese edition extraction seems to contain 100000+ entries with no part-of-speech.

Also, it seems to contain translations with no language code.

xxyzz · 2024-01-26T05:30:46Z

I think most of them are "redirect" pages that only have a template says for details see another word page, some examples: 100均, 別个, chāi-tòng-chāi-tōe. I could add a "redirects" field for these pages.

And some low quality pages only have a gloss sentence under the language level 2 title: paraphrase, парафразировать.

xxyzz · 2024-01-30T04:18:12Z

Expect for the "soft redirect" pages, I think most data that lacks pos or translation language code is from low quality pages that don't have POS title, use nonconvertible translation language names or have Lua/wikitext errors,

tatuylonen · 2024-02-24T13:13:49Z

If the redirect pages are semantically same as the "hard" redirects in the English edition, then I would suggest encoding them in the same way. But I assume there could be multiple redirect templates with these "soft" redirect pages.

Opinions include e.g.:

generate multiple "hard redirect" entries for the same word in the data
define a new "pos" field for these, e.g., "soft-redirect"
leave as is.

I would prefer to have a clear "pos" field in each entry. The existing "hard" redirects in English are currently an exception - but it might make sense to add a "hard-redirect" pos for them. I would vote for adding "soft-redirect" pos field for the soft redirects, and "unknown" for the low-quality pages that have no part-of-speech indication.

Opinions?

xxyzz · 2024-03-05T09:32:04Z

Sorry I missed your comment, I have created a draft pr #531 to change the pos fields.

This was referenced Jan 26, 2024

Handle zh edition soft redirect templates and gloss sentence only pages #474

Merged

Improve zh edition translation code #476

Merged

Improve zh edition translation code #479

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chinese edition extraction contains entries with no part-of-speech #471

Chinese edition extraction contains entries with no part-of-speech #471

tatuylonen commented Jan 26, 2024

xxyzz commented Jan 26, 2024 •

edited

xxyzz commented Jan 30, 2024

tatuylonen commented Feb 24, 2024

xxyzz commented Mar 5, 2024

Chinese edition extraction contains entries with no part-of-speech #471

Chinese edition extraction contains entries with no part-of-speech #471

Comments

tatuylonen commented Jan 26, 2024

xxyzz commented Jan 26, 2024 • edited

xxyzz commented Jan 30, 2024

tatuylonen commented Feb 24, 2024

xxyzz commented Mar 5, 2024

xxyzz commented Jan 26, 2024 •

edited