Other editions: tags #489

kristian-clausal · 2024-02-02T10:37:41Z

We've been working on getting websites for each of the current extractors working on kaikki.org, and came across an issue with the French edition: when generating the json mapping (see https://kaikki.org/dictionary/errors/mapping/index.html ) the whole process seemed to stall, get in a loop or just craaawl slowly. Thankfully, it was the last one, and the culprit was that the extracted French json tag data was... Prolific.

By which I mean, there was so much garbage tag data in the French .json that it actually slowed down what would otherwise have taken minutes to many, many hours.

"tags" in wiktextract terms should really conform to the English tags used in the English wiktextract output. Otherwise there's really no point with them: a finite, shared number of linguistic terminology (+ other) tags, and some generated on the fly based on heuristics, but something that is consistent and can be shared between projects. They're not really supposed to be just text found next to something, but should go through a formalization (and translation, if needed) process.

As it is currently, almost anything that is put into the 'tags' fields should be disabled for now.

xxyzz · 2024-02-05T03:57:32Z

How about use the "non-en-tags" filed? I don't think we have the manpower to translate these tags...

And could you give some examples of which fields and pages have too much garbage data?

kristian-clausal · 2024-02-05T07:16:07Z

The issue is that all fields and all pages have unique tags.

According to Tatu, doing the translation work is "only the work of a few days". I am also skeptical that doing the translation work would be easy...

The issue with the French tags is that there are too many unique tags (a huge, huge number).

https://kaikki.org/frwiktionary/Fran%C3%A7ais/meaning/b/bi/bille%20en%20t%C3%AAte.html#fr-bille_en_t%C3%AAte-fr-adv-wfEz~p2s

"tags": [":"] is clear garbage data here. The problem is that tags should be tags, not plain-text that is inserted as tag data; there should at least be the equivalent of valid_tags set from the English edition that verifies that the 'tag' that is used is white-listed. Currently, everything is slipping through.

A first step that would help is 'normalizing' the text: removing capitalization, for example: "sens figuré" and "Sens figuré" are two unique tags, for example. Preferably they'd be something like "sens-figuré" or other, more general tag, which is in valid_tags.

For now, the French kaikki wiktionary page just needs to have "tag-collection" turned off for the json-mapping portion of the generation, so it's not acute, but it's still an issue.

(The English edition does have "generated" tags, for place names and such-like, so not everything is in valid_tags, but there are rules for the generation which keeps things in check.)

xxyzz · 2024-02-05T08:16:14Z

I seriously doubt the translation or transform code could be maintainable... We already have some long data files and even ourself rarely read them. I'm against translate non-English Wiktionary data to English because we can't make a standard works for all. I also against normalize the tags, we shouldn't change the data.

For example, there is a "Plus rare" tag in page https://fr.wiktionary.org/wiki/autrice#Nom_commun, why change it to a form that doesn't exist in the French Wiktionary? I think most users would compare the extracted data with the original Wiktionary page not the English Wiktionary. We should at least keep the original form.

I'll try to find these garbage tags and remove them.

xxyzz · 2024-02-26T02:04:38Z

How about only put the shared tags created in English extractor like form-of tags in the tags field, and put the original untranslated tags to tags_np(nonportable) field?

kristian-clausal · 2024-02-26T10:23:29Z

These aren't really "tags" as we consider them (pre-determined or formalized strings of data), because currently it's just text and notes. It also mixes in what should be in "topics" (non-linguistic semantic tags referring to the world at large). "raw_tags", or "original_notes" or something could be better.

xxyzz · 2024-02-27T08:43:06Z

All non-English extractors are using raw_tags now, values in the tags fields should be the same values used in the en extractor unless I missed some lines.

This was referenced Feb 5, 2024

Update fr edition linkage and pronounciation tags code #493

Merged

Reduce incorrect linkage tags in fr edition #495

Merged

Update fr edition extractor #496

Merged

Remove some fr edition tags #498

Merged

xxyzz mentioned this issue Feb 26, 2024

French edition "tags" fields are lists - should be space-separated strings #515

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Other editions: tags #489

Other editions: tags #489

kristian-clausal commented Feb 2, 2024

xxyzz commented Feb 5, 2024 •

edited

kristian-clausal commented Feb 5, 2024 •

edited

xxyzz commented Feb 5, 2024

xxyzz commented Feb 26, 2024

kristian-clausal commented Feb 26, 2024

xxyzz commented Feb 27, 2024

Other editions: tags #489

Other editions: tags #489

Comments

kristian-clausal commented Feb 2, 2024

xxyzz commented Feb 5, 2024 • edited

kristian-clausal commented Feb 5, 2024 • edited

xxyzz commented Feb 5, 2024

xxyzz commented Feb 26, 2024

kristian-clausal commented Feb 26, 2024

xxyzz commented Feb 27, 2024

xxyzz commented Feb 5, 2024 •

edited

kristian-clausal commented Feb 5, 2024 •

edited