Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

de-conf plugin request #567

Open
hgiesel opened this issue Nov 14, 2023 · 4 comments
Open

de-conf plugin request #567

hgiesel opened this issue Nov 14, 2023 · 4 comments

Comments

@hgiesel
Copy link

hgiesel commented Nov 14, 2023

I've been trying to parse Wiktionary pages like this one
However wtf does not parse the template string correctly: It fails to read {{de-conj|ab.tun<irreg>}} and skips the <irreg> part.

Another template it fails to parse correctly is on this page.

This:

{{de-conj|[[sich]]<accpron>_[[auf]]_[[sein]]en<pron>_[[Lorbeer]]en_[[aus.ruhen]]}

is turned by wtf into:

sich _auf_seinen _Lorbeeren_aus.ruhen
@spencermountain
Copy link
Owner

spencermountain commented Nov 15, 2023

hey Henrik, this is a toughie.
de-conj results are actually generated by a script, somewhere inside wiktionary. They auto-create easier conjugations, and allow users to set exceptions.
Conjugating german verbs is beyond the scope of wtf_wikipedia, but could be a candidate for a plugin.
You can see we're generating conjugations at de-compromise, if that's what your looking for.
cheers

@spencermountain spencermountain changed the title Issues with wiktionary templates de-conf plugin request Nov 15, 2023
@hgiesel
Copy link
Author

hgiesel commented Nov 15, 2023

I don't mean that it should generate the actual results. I actually intend to do that myself.
I mean that if I parse the Wiktionary page with wtf, it seems like it drops some parts from the document.

After parsing the page, the I want to have this text: ab.tun<irreg>, however it mutilates it to this: ab.tun, and skips the <irreg> part.

image Screenshot 2023-11-15 at 18 03 41

@spencermountain
Copy link
Owner

spencermountain commented Nov 15, 2023

ahh, ya. I see what you mean.
First- that sounds cool that you're reproducing the results. Please share-back what you can.

Yeah, as you suspected, it's the angle-brackets. The library involves a lot of xml tags, which by default, pass-through.
This also runs before the template parser.

It would be easy to support <irreg> but i'm just looking at the template doc, and see things like this:

{{de-conj|schwimmen<schwamm:schwomm[archaic; used up through the 19th century],geschwommen,schwämme:schwömme[rare]>}}

So yikes, I didn't know about this syntax. I'm not sure how to do it, to be honest.
You may find there is a solution somewhere in the kill_xml file - but I can't think of one right now.
cheers

@MarketingPip
Copy link
Contributor

@hgiesel - the wiki templates parsing for Wiktionary pages needs some love keep in mind... (not just for the irregg). As there is lot's of issue's in regards to proper parsing - tho I do believe @spencermountain is already aware of this.

If you happen to play with the English Wiktionary you will run into a lot of issues with improper parsing. That said - you are more than welcome to contribute any fixes you find - as I know @spencermountain is a very busy guy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants