Skip to content

dohliam/ipa-transcriber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

IPA Transcriber - Auto-transcribe arbitrary languages into phonemic IPA

This Ruby script leverages the IPA dictionary databases from the ipa-dict project to automatically convert orthographic text in a variety of languages into phonemic transcription in the International Phonetic Alphabet (IPA).

Take the following text (from here) for example:

Goat, Dog, and Cow were great friends. One day they went on a journey in a taxi.

Using the English (US) dictionary, this would be converted to:

ˈɡoʊt ˈdɔɡ ˈænd ˈkaʊ ˈwɝ ˈɡɹeɪt ˈfɹɛndz ˈwən ˈdeɪ ˈðeɪ ˈwɛnt ˈɑn ˈeɪ ˈdʒɝni ˈɪn ˈeɪ ˈtæksi

Results are adjustable using optional custom vocabulary lists.

Requirements

Since Ruby's inbuilt .upcase and .downcase methods don't support non-ASCII text, this script requires the alternative versions provided by the UnicodeUtils package:

gem install unicode_utils

Usage

To convert some text in a file, just execute ipa_transcriber.rb and provide an input file and IPA dictionary to use as a basis for conversion:

./ipa_transcriber.rb -f [TEXTFILE] -i [DICTIONARY]

See below for details on command-line options and example invocations.

Options

The following options are available. The -f and -i options are mandatory, but -w is optional:

  • -f, --filename FILE: Source file (specify a source text file to convert)
  • -i, --ipa-dict DICT: IPA dictionary file (specify the location of the IPA dictionary file to use for the language to convert from)
  • -w, --wordlist LIST: Optional custom word list (an additional list of words and IPA pronunciations to use for words that don't match the provided dictionary file -- e.g., proper names, nonce words, loanwords, etc.)

Examples

The following examples assume that you have cloned or downloaded and extracted the ipa-dict to your home folder.

Transcribe some English (US) text into IPA:

./ipa_transcriber.rb -f ~/english.txt -i ~/ipa-dict/data/en_US.txt

Transcribe some French (Standard) text into IPA:

./ipa_transcriber.rb -f ~/french.txt -i ~/ipa-dict/data/fr_FR.txt

Transcribe some Japanese text into IPA:

./ipa_transcriber.rb -f ~/japanese.txt -i ~/ipa-dict/data/ja.txt

Notes

  • The automated IPA transcription will generally need to be manually tweaked in order to disambiguate homographs (e.g., "read" or "bow"), as well as words not found in the IPA dictionary. Some of this work can be aided by using the -w option and supplying a custom list of special words used in a particular text.
  • Languages whose orthographies do not use spaces to separate words (such as Chinese and Japanese) will need to be manually spaced before converting to IPA. There are tools available that can automate this process to some extent, but their results will need to be carefully reviewed as parsing errors are common.

Contributing

This project was developed to support the creation of Storybooks Speech and Hearing, and has been used to convert a corpus of stories in more than a dozen languages. PRs and other contributions to expand functionality for other use cases are more than welcome!

To do

  • Allow for one-off conversion of text on the command-line
  • Handle conversion of text from STDIN
  • Add config file to allow setting default language

License

MIT.

About

IPA Transcriber - Auto-transcribe arbitrary languages into phonemic IPA

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages