Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPA Dictionary for Italian #7

Open
loretoparisi opened this issue Jul 13, 2018 · 12 comments
Open

IPA Dictionary for Italian #7

loretoparisi opened this issue Jul 13, 2018 · 12 comments

Comments

@loretoparisi
Copy link

Any plans to add Italian IPA dict? Thanks.

@dohliam
Copy link
Member

dohliam commented Jul 14, 2018

It would be great to add Italian -- do you know of any sources for such a dictionary?

If there isn't something already available under an open license, it might be possible to generate a dictionary using a script. In that case, we would need to have both a word list (something like Aspell would probably be fine) and a list of rules for representing Italian orthography in IPA. The script would then apply these rules on the word list to generate the dictionary.

The script option described above is only really practical if there is a reasonably consistent correspondence between the orthography and pronunciation. My impression is that this is the case with Standard Italian, so it might be worth a try if nothing else is available.

@loretoparisi
Copy link
Author

loretoparisi commented Jul 15, 2018

@dohliam thanks! I will have a look to find a good dictionary for that.

For the spelling part in IT there are the hunspell dictionaries here: https://github.com/loretoparisi/dictionaries
adapted to HunSpell from LibreOffice dictionaries:
https://github.com/LibreOffice/dictionaries

that have dictionaries for

  • hunspell - basic spell check using the Hunspell engine
  • hyphen - words hyphenation
  • thesaurus - synonyms and acronyms
  • grammar - grammar check using different frameworks, e.g LanguageTool, Lightproof

while CMUSphinx is a good source for the phonetics dictionaries: https://cmusphinx.github.io/ and they have the Italian Phonetics Dictionary (used to build a Grapheme to Phoneme prediction as well) in the downloads: https://cmusphinx.github.io/wiki/download/ and here: https://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/Italian/

as well as other languages. A problem that I can see is that the encoding of the IPA symbols it's not clear to me:

celebrare tSS e l e b r a1 r e
celebrato tSS e l e b r a1 t o
celebravano tSS e l e b r a1 v a n o
celeste tSS e l EE s t e
celestiale tSS e l e s t j a1 l e
celestiali tSS e l e s t j a1 l i
celesti tSS e l EE s t i
celio tSS EE l j o
celi tSS EE l i
cella tSS EE l l a
cenare tSS e n a1 r e
cenarono tSS e n a1 r o n o
cenato tSS e n a1 t o

@dohliam
Copy link
Member

dohliam commented Jul 16, 2018

@loretoparisi Fantastic! Thanks for finding all this info. I wasn't aware that there were CMU dictionaries for other languages. The transcription format is indeed a little odd, but luckily it's also fairly familiar since I already converted the en_US dictionary from CMU format before.

I've written a quick script (here) to convert the Italian CMU dictionary into IPA. You can see the result of this conversion here.

There are still some remaining issues -- notably the sound they transcribe as nf is highly questionable since it sometimes seems to correspond to ŋf (e.g., trionfo), sometimes to nv (e.g., circonvicini) and sometimes to nf (e.g., conferma). These may need to be manually fixed.

I've managed to extract the primary stress markers out of the data, which is useful, but because they place stress on the vowel and don't indicate syllable boundaries, it's very difficult to position these correctly at the beginning of the syllable in the resulting IPA. So for example, città is converted to /tʃittˈa/ rather than /tʃitˈta/ because we would need some way for the script to know that the syllable should be split between the two consonants. These will have to be adjusted by adding syllable parsing rules to the script (or manually).

The provided CMU dictionary is a little small unfortunately -- only 7109 entries. It's a good start, but it would be much better if we could parse the Hunspell / Aspell word list instead. Do you have any experience with using CMU Sphinx to generate phonetic output? If so, we could use my script to convert the result to IPA.

@loretoparisi
Copy link
Author

loretoparisi commented Jul 17, 2018

@dohliam You are welcome, as you said it's a good start! I think it's a good idea to use CMU Sphinx directly to generate a phonetic output using the model provided for the italian (that is the file it.fst), this should handle the problem of out of vocabulary words. Let me have a look at the model. Of course since the training was done on a small dictionary (the 7109 entries) we could also have false positive in the output, but this is something we should check later on.
By the way according to the it model readme for this model we have:

                          EVALUATION RESULTS                          
----------------------------------------------------------------------
(T)otal tokens in reference: 2528
(M)atches: 2404  (S)ubstitutions: 122  (I)nsertions: 0  (D)eletions: 2
% Correct (M/T)           -- %95.09
% Token ER ((S+I+D)/T)    -- %4.91
% Accuracy 1.0-ER         -- %95.09
       --------------------------------------------------------       
(S)equences: 357  (C)orrect sequences: 257  (E)rror sequences: 100
% Sequence ER (E/S)       -- %28.01
% Sequence Acc (1.0-E/S)  -- %71.99
######################################################################

I will try to run the model over the Hunspell dict and we will se how accuracy goes on the test set.
I have put the stuff here as well: https://github.com/loretoparisi/ipa-phonetics-dict/blob/master/it/README

Starting from the new work of CMU guys I have also worked on a Tensorflow G2P model to take in account out of vocabulary words and have a Neural Network model for that. This is the docker I'm using for that:

https://github.com/loretoparisi/docker/tree/master/g2p-seq2seq

This is a work in progress, and it should replace the current CMU models in the next, so it will work for italian too.

@dohliam
Copy link
Member

dohliam commented Jul 17, 2018

@loretoparisi That's amazing! Sounds like it could be a much better approach, and it will be interesting to see how accurate the results are on the Hunspell list. In the meantime I'll see what I can do about the syllabification issue -- hopefully there are enough clear rules about what constitutes a syllable that we can automate the conversion of stress markers in the final result.

@dohliam
Copy link
Member

dohliam commented Dec 28, 2018

@loretoparisi Just checking in... Have you had any progress with this so far? It would be great to add Italian to the database once it's ready! 😄

@loretoparisi
Copy link
Author

@dohliam I have basically used this one https://github.com/loretoparisi/ipa-phonetics-dict/tree/master/it For the spelling accuracy I have to go back since I did times ago. I will update.

@dohliam
Copy link
Member

dohliam commented Dec 29, 2018

@loretoparisi Excellent, thanks! 👍 I have this version from before but will wait for the update to convert it and add to the database.

@doolio
Copy link

doolio commented Feb 3, 2023

I've just discovered your project. Any further progress on adding Italian to the database?

@dohliam
Copy link
Member

dohliam commented Feb 10, 2023

@doolio The links above are the latest progress I am aware of with the Italian IPA list. In case you would like to try working with something in the meantime, there are two options: this list which is not very large and has been auto-generated based on the Italian CMU dictionary, and this one which attempts to use a G2P approach to handle out-of-vocabulary words. Neither of these has been manually checked for errors, though, which is why there is currently nothing for Italian yet in the main repo here. All contributions welcome! 😄

@loretoparisi Have you had the chance to take a look at this recently? It would be great to add Italian to the project if possible.

@doolio
Copy link

doolio commented Feb 11, 2023

Thanks. Yes, I had a look at the first list already as it was linked earlier in this discussion. You seemed to have forgotten the link to the second list. I'm trying to learn Italian and in doing so if I improve these lists I will of course contribute them back here.

@dohliam
Copy link
Member

dohliam commented Feb 11, 2023

@doolio Fixed the link, but that page is also linked earlier in this discussion, so you may have seen it already. In both cases, the output needs to be checked by someone to make sure there are no glaring errors in the transcription. My sense is that Italian orthography might be regular enough that there would likely not be more than a few outliers or exceptions for a rule-based transcription, but it would be nice if someone could confirm that and correct the output if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants