Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other editions: tags #489

Open
kristian-clausal opened this issue Feb 2, 2024 · 6 comments
Open

Other editions: tags #489

kristian-clausal opened this issue Feb 2, 2024 · 6 comments

Comments

@kristian-clausal
Copy link
Collaborator

We've been working on getting websites for each of the current extractors working on kaikki.org, and came across an issue with the French edition: when generating the json mapping (see https://kaikki.org/dictionary/errors/mapping/index.html ) the whole process seemed to stall, get in a loop or just craaawl slowly. Thankfully, it was the last one, and the culprit was that the extracted French json tag data was... Prolific.

By which I mean, there was so much garbage tag data in the French .json that it actually slowed down what would otherwise have taken minutes to many, many hours.

"tags" in wiktextract terms should really conform to the English tags used in the English wiktextract output. Otherwise there's really no point with them: a finite, shared number of linguistic terminology (+ other) tags, and some generated on the fly based on heuristics, but something that is consistent and can be shared between projects. They're not really supposed to be just text found next to something, but should go through a formalization (and translation, if needed) process.

As it is currently, almost anything that is put into the 'tags' fields should be disabled for now.

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 5, 2024

How about use the "non-en-tags" filed? I don't think we have the manpower to translate these tags...

And could you give some examples of which fields and pages have too much garbage data?

@kristian-clausal
Copy link
Collaborator Author

kristian-clausal commented Feb 5, 2024

The issue is that all fields and all pages have unique tags.

According to Tatu, doing the translation work is "only the work of a few days". I am also skeptical that doing the translation work would be easy...

The issue with the French tags is that there are too many unique tags (a huge, huge number).

https://kaikki.org/frwiktionary/Fran%C3%A7ais/meaning/b/bi/bille%20en%20t%C3%AAte.html#fr-bille_en_t%C3%AAte-fr-adv-wfEz~p2s

"tags": [":"] is clear garbage data here. The problem is that tags should be tags, not plain-text that is inserted as tag data; there should at least be the equivalent of valid_tags set from the English edition that verifies that the 'tag' that is used is white-listed. Currently, everything is slipping through.

A first step that would help is 'normalizing' the text: removing capitalization, for example: "sens figuré" and "Sens figuré" are two unique tags, for example. Preferably they'd be something like "sens-figuré" or other, more general tag, which is in valid_tags.

For now, the French kaikki wiktionary page just needs to have "tag-collection" turned off for the json-mapping portion of the generation, so it's not acute, but it's still an issue.

(The English edition does have "generated" tags, for place names and such-like, so not everything is in valid_tags, but there are rules for the generation which keeps things in check.)

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 5, 2024

I seriously doubt the translation or transform code could be maintainable... We already have some long data files and even ourself rarely read them. I'm against translate non-English Wiktionary data to English because we can't make a standard works for all. I also against normalize the tags, we shouldn't change the data.

For example, there is a "Plus rare" tag in page https://fr.wiktionary.org/wiki/autrice#Nom_commun, why change it to a form that doesn't exist in the French Wiktionary? I think most users would compare the extracted data with the original Wiktionary page not the English Wiktionary. We should at least keep the original form.

I'll try to find these garbage tags and remove them.

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 26, 2024

How about only put the shared tags created in English extractor like form-of tags in the tags field, and put the original untranslated tags to tags_np(nonportable) field?

@kristian-clausal
Copy link
Collaborator Author

These aren't really "tags" as we consider them (pre-determined or formalized strings of data), because currently it's just text and notes. It also mixes in what should be in "topics" (non-linguistic semantic tags referring to the world at large). "raw_tags", or "original_notes" or something could be better.

@xxyzz
Copy link
Collaborator

xxyzz commented Feb 27, 2024

All non-English extractors are using raw_tags now, values in the tags fields should be the same values used in the en extractor unless I missed some lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants