Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[arm] finding IPA transcriptions outside of the Pronunciation block #470

Open
jhdeov opened this issue Nov 7, 2022 · 5 comments
Open

[arm] finding IPA transcriptions outside of the Pronunciation block #470

jhdeov opened this issue Nov 7, 2022 · 5 comments

Comments

@jhdeov
Copy link
Contributor

jhdeov commented Nov 7, 2022

For the word կարկանդակ, wikipron finds the correct pronunciation of [kɑɾkɑndɑk] but it also finds the IPA transcriptions of other words in the Usage Notes section like [pɛrɑʃˈki]. I'm not sure if this is an unavoidable glitch from Wikipron's side, or if it's a glitch that could be fixed from the Wiktionary side.

It seems that what's going on is that WikiPron is just finding any IPA transcription that's inside the Armenian entry, even if it's not associated with a dialect. E.g., if you run wikipron arm --dialect='ladygaga' --no-skip-parens --narrow > randos.tsv you get a handful of IPA transcriptions that aren't associated with the pre-defined dialects. These are either a) IPA transcriptions in the Usage notes or etymology, or b) IPA transcriptions for non-standard dialects. This isn't a problem for using Wikipron on a specific language (because the person can just filter those out manually). But I wonder if this glitch causes any other funny business for the other languages.

Side note: I wonder if there's been enough situations where people had to fix Wiktionary entries in order to optimize Wikipron's scraper (like on the various closed issues). If so, perhaps a tips and tricks page would be helpful down the line?

@kylebgorman
Copy link
Collaborator

It basically finds anything in the pronunciation section in // or []. TBF it is bizarre to be giving the pronunciation of an unrelated Russian word here. I'm going to edit the entry.

The Wiktionary people have taken absolutely zero interest in our project so I don't think there's a demand outside of WikiPron developers for this information.

@jhdeov
Copy link
Contributor Author

jhdeov commented Nov 7, 2022

Heh, the admin ended up agreeing image

@jhdeov
Copy link
Contributor Author

jhdeov commented Nov 7, 2022

It basically finds anything in the pronunciation section in // or [].

But then this is a glitch though because the Russian word was not under the pronunciation section but under a separate heading. The original example is gone now, but another example is գրաբար. The usage notes explain a pronunciation tidbit. It's in a separate section, but it's getting scraped too.

@kylebgorman
Copy link
Collaborator

kylebgorman commented Nov 7, 2022 via email

@jhdeov
Copy link
Contributor Author

jhdeov commented Nov 7, 2022

Wikipron also found IPAs that were in the etymology section, before the pronunciation section. This word had a transcription there until I found and removed it (via the above 'fake dialect' trick).

This makes me think that Wikipron is looking IPA anywhere in the entry, and not just in the pronunciation box. I'm not sure if that's an error (because the code isn't designed to go out of the pronunciation box) or a missing feature (because the code is designed to go out of the pronunciation box).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants