-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Other editions: tags #489
Comments
How about use the "non-en-tags" filed? I don't think we have the manpower to translate these tags... And could you give some examples of which fields and pages have too much garbage data? |
The issue is that all fields and all pages have unique tags. According to Tatu, doing the translation work is "only the work of a few days". I am also skeptical that doing the translation work would be easy... The issue with the French tags is that there are too many unique tags (a huge, huge number). "tags": [":"] is clear garbage data here. The problem is that tags should be tags, not plain-text that is inserted as tag data; there should at least be the equivalent of A first step that would help is 'normalizing' the text: removing capitalization, for example: "sens figuré" and "Sens figuré" are two unique tags, for example. Preferably they'd be something like "sens-figuré" or other, more general tag, which is in For now, the French kaikki wiktionary page just needs to have "tag-collection" turned off for the json-mapping portion of the generation, so it's not acute, but it's still an issue. (The English edition does have "generated" tags, for place names and such-like, so not everything is in |
I seriously doubt the translation or transform code could be maintainable... We already have some long data files and even ourself rarely read them. I'm against translate non-English Wiktionary data to English because we can't make a standard works for all. I also against normalize the tags, we shouldn't change the data. For example, there is a "Plus rare" tag in page https://fr.wiktionary.org/wiki/autrice#Nom_commun, why change it to a form that doesn't exist in the French Wiktionary? I think most users would compare the extracted data with the original Wiktionary page not the English Wiktionary. We should at least keep the original form. I'll try to find these garbage tags and remove them. |
How about only put the shared tags created in English extractor like |
These aren't really "tags" as we consider them (pre-determined or formalized strings of data), because currently it's just text and notes. It also mixes in what should be in "topics" (non-linguistic semantic tags referring to the world at large). "raw_tags", or "original_notes" or something could be better. |
All non-English extractors are using |
We've been working on getting websites for each of the current extractors working on kaikki.org, and came across an issue with the French edition: when generating the json mapping (see https://kaikki.org/dictionary/errors/mapping/index.html ) the whole process seemed to stall, get in a loop or just craaawl slowly. Thankfully, it was the last one, and the culprit was that the extracted French json tag data was... Prolific.
By which I mean, there was so much garbage tag data in the French .json that it actually slowed down what would otherwise have taken minutes to many, many hours.
"tags" in wiktextract terms should really conform to the English tags used in the English wiktextract output. Otherwise there's really no point with them: a finite, shared number of linguistic terminology (+ other) tags, and some generated on the fly based on heuristics, but something that is consistent and can be shared between projects. They're not really supposed to be just text found next to something, but should go through a formalization (and translation, if needed) process.
As it is currently, almost anything that is put into the 'tags' fields should be disabled for now.
The text was updated successfully, but these errors were encountered: