Nested lists for subdialect info #329

jhdeov · 2021-01-23T05:03:01Z

I noticed this problem for Armenian and a colleague told me it's also found in Portuguese. For some languages, the pronunciation entry can use a nested list, such that

The first level contains the main dialect name
The second level contains a subdialect name

For example, Portuguese <afetar> has a level-1 entry for (standard) Brazilian Portuguese. But this entry has 3 level-2 entries for different regions of Brazil. As of now, WikiPron scraps all 4 pronunciations as part of "Brazilian Portuguese". But that obfuscates the fact that the 4 entries correspond to separate subdialects.

It would be nice if the script could 'fix' this somehow. Maybe you can add an extra column to the scraped content, such that the new column would keep the name of its line-entry's name. For example, for afetar, maybe you could return something like

kylebgorman · 2021-01-23T14:42:15Z

I'd rather just have them in separate files and if people want to combine them they're welcome to. I can't think how I'd use the three-column file for actual processing without just filtering it into separate dialects, so we should do that from the jump. That said the pt issue you point out is important. We should fix that upstream (probably).

…

On Sat, Jan 23, 2021 at 12:03 AM Hossep Dolatian ***@***.***> wrote: I noticed this problem for Armenian <https://en.wiktionary.org/wiki/%D5%A5%D6%80%D5%AF%D6%80%D5%B8%D6%80%D5%A4> and a colleague told me it's also found in Portuguese <https://en.wiktionary.org/wiki/afetar>. For some languages, the pronunciation entry can use a nested list, such that - The first level contains the main dialect name - The second level contains a subdialect name For example, Portuguese <afetar <https://en.wiktionary.org/wiki/afetar>> has a level-1 entry for (standard) Brazilian Portuguese. But this entry has 3 level-2 entries for different regions of Brazil. As of now, WikiPron scraps all 4 pronunciations <https://github.com/kylebgorman/wikipron/blob/master/data/tsv/por_bz_phonemic_filtered.tsv> as part of "Brazilian Portuguese". But that obfuscates the fact that the 4 entries correspond to separate subdialects. It would be nice if the script could 'fix' this somehow. Maybe you can add an extra column to the scraped content, such that the new column would keep the name of its line-entry's name. For example, for , maybe you could return something like afetar | a f e t a ɹ | Brazil afetar | a f e t a ɻ | Paulista afetar | a f e t a ʁ | South Brazil afetar | a f e t a χ | Carioca — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/issues/329>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OPUNYHJ25LVQEJMGI3S3JKBBANCNFSM4WPMQETA> .

kylebgorman · 2021-01-23T19:46:54Z

@jacksonllee would love your thoughts on this. Keeps coming up...

jacksonllee · 2021-01-24T00:03:38Z

I'd rather just have them in separate files and if people want to combine
them they're welcome to. I can't think how I'd use the three-column file
for actual processing without just filtering it into separate dialects, so
we should do that from the jump.

I share the same intuition. In general, augmenting the scraped data beyond the current two-column format would be what we should avoid within WikiPron. In this particular, I agree that just having separate files for individual subdialectal varieties is the way to go.

As for implementation -- if the Portugues afetar example is representative (the Armenian example shows that it's not -- I'll come back to this below), it might be possible to tighten the IPA extraction so that it more narrowly targets the individual <li> code somehow, while ignoring the distinction between the top-level <ul> (for "Portugal" and "Brazil") versus the embedded <ul> (for "Paulista", "South Brazil", and "Carioca"). If we do this, a scraping run will result in as many files as there are dialects and subdialects having been specified. The IPA extraction update could be done as a custom extractor (as has been done for other languages, for other reasons), if we're concerned about changing the default extraction behavior and its potential unknown effects for all other languages.

The Armenian երկրորդ example shows the embedded <ul> with just the word "colloquial" as the dialect/variety indicator, which is a challenge. For instance, if someone specifies "Eastern Armenian, colloquial" for the --dialect flag in a scraping run, there will be no result currently, because there's no way WikiPron would know that it should target the "colloquial" part under "Eastern Armenian, standard" in the given Wiktionary HTML. For cases like this, it would be ideal if the upstream Wiktionary data could be updated, so that the dialect/indicator is more verbose (i.e., showing "Eastern Armenian, standard", "Eastern Armenian, colloquial", "Western Armenian, standard", and "Western Armenian, colloquial" all in full) and that we then could use a similar strategy within WikiPron like what the previous paragraph describes for Portuguese.

jhdeov · 2021-01-24T07:52:25Z

I managed to get the Armenian Wiktionary editors to upgrade their script so that the colloquial entries now have a more informative text, like Portuguese.

kylebgorman · 2021-01-24T14:47:28Z

Nice!

…

On Sun, Jan 24, 2021 at 2:52 AM Hossep Dolatian ***@***.***> wrote: I managed to get the Armenian Wiktionary editors to upgrade their script so that the colloquial entries now have a more informative text, like Portuguese. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <https://github.com/kylebgorman/wikipron/issues/329#issuecomment-766307312>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABG4OLMO5F7SP4TF4HGNRLS3PGUJANCNFSM4WPMQETA> .

kylebgorman · 2021-01-27T18:55:25Z

Did the upstream editors fix this as far as our code is concerned?

jhdeov · 2021-01-27T19:24:28Z

Well for Armenian, most of the entries now look like Portuguese. There are still probably some stragglers because of some older entries. But I can find+fix those once I find out how the WikiPron code is handling the canonical cases like Portuguese.

kylebgorman added the bug Something isn't working label Jan 23, 2021

jacksonllee mentioned this issue Jan 30, 2021

Finishes scrape, adds restart command #340

Merged

1 task

lfashby mentioned this issue Mar 27, 2021

Experimental Min Nan extraction function #397

Merged

1 task

agutkin mentioned this issue Jun 24, 2021

Unassigned/non-standard (compound) language and dialect codes #432

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested lists for subdialect info #329

Nested lists for subdialect info #329

jhdeov commented Jan 23, 2021 •

edited

kylebgorman commented Jan 23, 2021 via email

kylebgorman commented Jan 23, 2021

jacksonllee commented Jan 24, 2021

jhdeov commented Jan 24, 2021

kylebgorman commented Jan 24, 2021 via email

kylebgorman commented Jan 27, 2021

jhdeov commented Jan 27, 2021

Nested lists for subdialect info #329

Nested lists for subdialect info #329

Comments

jhdeov commented Jan 23, 2021 • edited

kylebgorman commented Jan 23, 2021 via email

kylebgorman commented Jan 23, 2021

jacksonllee commented Jan 24, 2021

jhdeov commented Jan 24, 2021

kylebgorman commented Jan 24, 2021 via email

kylebgorman commented Jan 27, 2021

jhdeov commented Jan 27, 2021

jhdeov commented Jan 23, 2021 •

edited