Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese edition extraction contains entries with no part-of-speech #471

Open
tatuylonen opened this issue Jan 26, 2024 · 4 comments
Open

Comments

@tatuylonen
Copy link
Owner

The Chinese edition extraction seems to contain 100000+ entries with no part-of-speech.

Also, it seems to contain translations with no language code.

@xxyzz
Copy link
Collaborator

xxyzz commented Jan 26, 2024

I think most of them are "redirect" pages that only have a template says for details see another word page, some examples: 100均, 別个, chāi-tòng-chāi-tōe. I could add a "redirects" field for these pages.

And some low quality pages only have a gloss sentence under the language level 2 title: paraphrase, парафразировать.

@xxyzz
Copy link
Collaborator

xxyzz commented Jan 30, 2024

Expect for the "soft redirect" pages, I think most data that lacks pos or translation language code is from low quality pages that don't have POS title, use nonconvertible translation language names or have Lua/wikitext errors,

@tatuylonen
Copy link
Owner Author

If the redirect pages are semantically same as the "hard" redirects in the English edition, then I would suggest encoding them in the same way. But I assume there could be multiple redirect templates with these "soft" redirect pages.

Opinions include e.g.:

  • generate multiple "hard redirect" entries for the same word in the data
  • define a new "pos" field for these, e.g., "soft-redirect"
  • leave as is.

I would prefer to have a clear "pos" field in each entry. The existing "hard" redirects in English are currently an exception - but it might make sense to add a "hard-redirect" pos for them. I would vote for adding "soft-redirect" pos field for the soft redirects, and "unknown" for the low-quality pages that have no part-of-speech indication.

Opinions?

@xxyzz
Copy link
Collaborator

xxyzz commented Mar 5, 2024

Sorry I missed your comment, I have created a draft pr #531 to change the pos fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants