Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Post-processing the inflections #161

Open
Vuizur opened this issue Sep 16, 2022 · 3 comments
Open

Post-processing the inflections #161

Vuizur opened this issue Sep 16, 2022 · 3 comments

Comments

@Vuizur
Copy link
Contributor

Vuizur commented Sep 16, 2022

For many applications, for example for the creation of lemmatization lists or dictionaries, it would be super useful to have a post-processed version of the inflections, or something like a function that can be called that will do the post-processing. I was thinking of following features:

  • Deleting the inflections that are word information and really an inflection. This can probably easily be done by dropping everything with the tags auxiliary, table-tags, class, ... (here it only important not to miss anything, and end users might not realize that they have to do this, I for example didn't at first).
  • Cleaning empty inflections ("" or "-")
  • Getting a word list cleaned of multiword constructions. For example, you have in German many entries like "einen gönnerhaften", "den gönnerhaften", "die gönnerhaften", where the only difference is the article. If one does not need the grammatical information (like someone creating a lemmatization list or dictionary inflections), one could simply reduce everything to "gönnerhaften". Possibly one could (optionally) delete compound time forms like "will eat" in cases where they are not needed.
  • Advanced, but very useful: Adding inflections that have their own Wiktionary entries (with form_of linking to their base word), but do not appear in the inflection table. The primary use case I can think of are the compound forms in romance languages, for example here. That way an end user simply could drop all form-of senses and not lose out on anything. The only question here is a reasonably fast implementation. I have never used it, but one could maybe load the JSON into MongoDB and index the relevant data. But no idea. The algorithm probably has also to be recursive to handle transitive inflections.

The program might be out of scope for Wiktextract, but I think this at least something that should be solved "centrally" and not by everyone reimplementing something similar separately. So maybe someone has an idea how/where to solve this the best way.

@fpw
Copy link

fpw commented Sep 22, 2022

Hi,

I did a lot of work in that direction with the Latin version of en.wiktionary using the data on kaikki.org. I created a dictionary from that which also knows all inflected versions of a word and the relation to the lemma, e.g. for "gönnerhaften" it would show that the base lemma is "gönnerhaft" and that the form could be the Akkusativ singular, Nominativ plural.

In the end, I'm creating a basic lemmatizer for Latin texts with this: Input a Latin text and it'll show you the possible base words and forms of each word in the text.

My approach to process the data from kaikki.org works like this:

  • drop all entries that use a headword template ending on "-form" or that use "head-...-forms" template. This basically removes all inflected Wiktionary pages because they always use these templates to link to their base word. As an example, it would drop the comérselo page that you linked above.
  • drop a few special templates that are new words created from base words, e.g. participles
  • match the different part of speech variants to a few basic ones,
  • match the inflection templates with the head templates: This is the hard part, it processes pages that have multiple headwords and inflection tables and figures out which headword belongs to which table.
  • write tuples containing (headword_template, inflection_template, senses)

This preprocessing leaves me with a short entry for each lemma, for example
{"id":"fabula/1","lemma":"fabula","partOfSpeech":"Noun","heads":["{{la-noun|fābula<1>}}"],"inflections":["{{la-ndecl|fābula<1>}}"],"senses":["discourse, narrative","a fable, tale, story","a poem, play","concern, matter","romance"]}

Initially, I then used the original Lua scripts that power Wiktionary's inflection engine for Latin and ran them in a little sandbox so that they don't depend on all the Wiktionary scripts. I created that sandbox using Fengari so that I can run the whole thing in a browser and in a backend. That worked really well, I was able to derive all the inflections using that scripts, based on only the inflection template string.

But that turned out to be too slow. Processing the entire Latin part of the Wiktionary (10 MB of JSON output from the preprocessor, ~60k words) takes about 45 minutes like that.

So in the end, I re-implemented the inflection engine's Lua scripts in TypeScript here, using the output of the original scripts to create test vectors to make sure my engine has 100% the same output. That brought the processing time down to a minute, leaving me with a system that allows reading a Latin text in the browser where the system provides possible base words, form information and translations for every word to support the reader.

I explained the process in detail to support the following claim: I don't think that a general solution could work. The languages are very different and also the entire infrastructure of the Wiktionary is completely different for every language. They all have their own inflection scripts with different parameters etc., even their output doesn't use a common format.

I think the most viable approach would be creating preprocessors for each language that convert the Wiktionary dump into a format that removes the differences through abstraction, and then going from there using the original Lua scripts as a base.

Hope that helps!

@Vuizur
Copy link
Contributor Author

Vuizur commented Sep 24, 2022

I think the wiktextract maintainers put a lot of work into the inflection table extraction, and the current state of the data is really good, especially if you consider that it is the first project properly extracting inflections for all languages. But running it probably takes quite a lot of resources.

My main idea for fixing entries like "die gönnerhaften" was to have a function that looks at all inflections that consist of more words than the base word and heuristically dropping one of these words based on length and Levenshtein distance to the original word. (And then look at how this approach works or doesn't work for each language).

@tatuylonen
Copy link
Owner

I would like to further improve the quality of the extracted inflections. However, the extraction of certain inflections as "-" is intentional - it gives potentially useful information indicating that the form does not exist/is not attested for the word or in the language. As for the multiword constructions, including them is certainly controversial. The goal is to mark them with the "multiword-construction" tag, so that they can easily be removed in applications that don't need them. However, they could be useful in, for example, applications where we try to automatically learn aspects of the grammar of the language. Thus I've chosen to include them.

Constructions that differ with just the article are more of a question mark. They should now be marked with "includes-article", and applications could easily skip such forms. However, having so many fairly trivial forms for thousands/tens of thousands of words increases the data size, which makes overall use of it more cumbersome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants