-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chinese edition extraction contains entries with no part-of-speech #471
Comments
I think most of them are "redirect" pages that only have a template says for details see another word page, some examples: 100均, 別个, chāi-tòng-chāi-tōe. I could add a "redirects" field for these pages. And some low quality pages only have a gloss sentence under the language level 2 title: paraphrase, парафразировать. |
Expect for the "soft redirect" pages, I think most data that lacks pos or translation language code is from low quality pages that don't have POS title, use nonconvertible translation language names or have Lua/wikitext errors, |
If the redirect pages are semantically same as the "hard" redirects in the English edition, then I would suggest encoding them in the same way. But I assume there could be multiple redirect templates with these "soft" redirect pages. Opinions include e.g.:
I would prefer to have a clear "pos" field in each entry. The existing "hard" redirects in English are currently an exception - but it might make sense to add a "hard-redirect" pos for them. I would vote for adding "soft-redirect" pos field for the soft redirects, and "unknown" for the low-quality pages that have no part-of-speech indication. Opinions? |
Sorry I missed your comment, I have created a draft pr #531 to change the pos fields. |
The Chinese edition extraction seems to contain 100000+ entries with no part-of-speech.
Also, it seems to contain translations with no language code.
The text was updated successfully, but these errors were encountered: