IPA Transcriber - Auto-transcribe arbitrary languages into phonemic IPA

This Ruby script leverages the IPA dictionary databases from the ipa-dict project to automatically convert orthographic text in a variety of languages into phonemic transcription in the International Phonetic Alphabet (IPA).

Take the following text (from here) for example:

Goat, Dog, and Cow were great friends. One day they went on a journey in a taxi.

Using the English (US) dictionary, this would be converted to:

ˈɡoʊt ˈdɔɡ ˈænd ˈkaʊ ˈwɝ ˈɡɹeɪt ˈfɹɛndz ˈwən ˈdeɪ ˈðeɪ ˈwɛnt ˈɑn ˈeɪ ˈdʒɝni ˈɪn ˈeɪ ˈtæksi

Results are adjustable using optional custom vocabulary lists.

Requirements

Since Ruby's inbuilt .upcase and .downcase methods don't support non-ASCII text, this script requires the alternative versions provided by the UnicodeUtils package:

gem install unicode_utils

Usage

To convert some text in a file, just execute ipa_transcriber.rb and provide an input file and IPA dictionary to use as a basis for conversion:

./ipa_transcriber.rb -f [TEXTFILE] -i [DICTIONARY]

See below for details on command-line options and example invocations.

Options

The following options are available. The -f and -i options are mandatory, but -w is optional:

-f, --filename FILE: Source file (specify a source text file to convert)
-i, --ipa-dict DICT: IPA dictionary file (specify the location of the IPA dictionary file to use for the language to convert from)
-w, --wordlist LIST: Optional custom word list (an additional list of words and IPA pronunciations to use for words that don't match the provided dictionary file -- e.g., proper names, nonce words, loanwords, etc.)

Examples

The following examples assume that you have cloned or downloaded and extracted the ipa-dict to your home folder.

Transcribe some English (US) text into IPA:

./ipa_transcriber.rb -f ~/english.txt -i ~/ipa-dict/data/en_US.txt

Transcribe some French (Standard) text into IPA:

./ipa_transcriber.rb -f ~/french.txt -i ~/ipa-dict/data/fr_FR.txt

Transcribe some Japanese text into IPA:

./ipa_transcriber.rb -f ~/japanese.txt -i ~/ipa-dict/data/ja.txt

Notes

The automated IPA transcription will generally need to be manually tweaked in order to disambiguate homographs (e.g., "read" or "bow"), as well as words not found in the IPA dictionary. Some of this work can be aided by using the -w option and supplying a custom list of special words used in a particular text.
Languages whose orthographies do not use spaces to separate words (such as Chinese and Japanese) will need to be manually spaced before converting to IPA. There are tools available that can automate this process to some extent, but their results will need to be carefully reviewed as parsing errors are common.

Contributing

This project was developed to support the creation of Storybooks Speech and Hearing, and has been used to convert a corpus of stories in more than a dozen languages. PRs and other contributions to expand functionality for other use cases are more than welcome!

To do

Allow for one-off conversion of text on the command-line
Handle conversion of text from STDIN
Add config file to allow setting default language

License

MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
LICENSE		LICENSE
README.md		README.md
ipa_transcriber.rb		ipa_transcriber.rb
lib_transcribe.rb		lib_transcribe.rb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

ipa_transcriber.rb

ipa_transcriber.rb

lib_transcribe.rb

lib_transcribe.rb

Repository files navigation

IPA Transcriber - Auto-transcribe arbitrary languages into phonemic IPA

Requirements

Usage

Options

Examples

Notes

Contributing

To do

License

About

Releases

Packages

Languages

License

dohliam/ipa-transcriber

Folders and files

Latest commit

History

Repository files navigation

IPA Transcriber - Auto-transcribe arbitrary languages into phonemic IPA

Requirements

Usage

Options

Examples

Notes

Contributing

To do

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages