Skip to content

bookbot-hive/lexikos

Repository files navigation

Lexikos - λεξικός /lek.si.kós/

A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.

logo

Install Lexikos

Install from PyPI

pip install lexikos

Editable install from Source

git clone https://github.com/bookbot-hive/lexikos.git
pip install -e lexikos

Usage

Lexicon

>>> from lexikos import Lexicon
>>> lexicon = Lexicon()
>>> print(lexicon["added"])
{'ˈæ d ɪ d', 'ˈæ ɾ ə d', 'æ ɾ ɪ d', 'a d ɪ d', 'ˈa d ɪ d', 'æ ɾ ə d', 'ˈa d ə d', 'a d ə d', 'ˈæ d ə d', 'æ d ə d', 'æ d ɪ d', 'ˈæ ɾ ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'ɹ ʌ n ɝ', 'ˈr ʌ n ɝ'}
>>> print(lexicon["water"])
{'ˈʋ aː ʈ ə r ɯ', 'ˈw oː t ə', 'w ɑ t ə ɹ', 'ˈw aː ʈ ə r ɯ', 'ˈw ɔ t ɝ', 'w ɔ t ə ɹ', 'ˈw ɑ t ə ɹ', 'w ɔ t ɝ', 'w ɑ ɾ ɚ', 'ˈw ɑ ɾ ɚ', 'ˈʋ ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɔː t ə', 'ˈw oː ɾ ə', 'ˈw ɔ ʈ ə r'}

To get a lexicon where phonemes are normalized (diacritics removed, digraphs split):

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(normalize_phonemes=True)
>>> print(lexicon["added"])
{'æ ɾ ɪ d', 'a d ɪ d', 'a d ə d', 'æ ɾ ə d', 'æ d ə d', 'æ d ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'r ʌ n ɝ', 'ɹ ʌ n ɝ'}
>>> print(lexicon["water"])
{'w o ɾ ə', 'w ɔ t ə', 'ʋ ɔ ʈ ə r', 'w a ʈ ə r ɯ', 'w ɔ t ə ɹ', 'ʋ a ʈ ə r ɯ', 'w ɑ ɾ ɚ', 'w o t ə', 'w ɔ t ɝ', 'w ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɑ t ə ɹ'}

To include synthetic (non-dictionary-based) pronunciations:

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(include_synthetic=True)
>>> print(lexicon["athletic"])
{'æ t l ɛ t ɪ k', 'æ θ ˈl ɛ t ɪ k', 'æ θ l ɛ t ɪ k'}

Phonemization

>>> from lexikos import G2p
>>> g2p = G2p(lang="en-us")
>>> g2p("Hello there! $100 is not a lot of money in 2023.")
['h ɛ l o ʊ', 'ð ɛ ə ɹ', 'w ʌ n', 'h ʌ n d ɹ ɪ d', 'd ɑ l ɚ z', 'ɪ z', 'n ɒ t', 'ə', 'l ɑ t', 'ʌ v', 'm ʌ n i', 'ɪ n', 't w ɛ n t i', 't w ɛ n t i', 'θ ɹ iː']
>>> g2p = G2p(lang="en-au")
>>> g2p("Hi there mate! Have a g'day!")
['h a ɪ', 'θ ɛ ə ɹ', 'm e ɪ t', 'h e ɪ v', 'ə', 'ɡ ə ˈd æ ɪ']

Dictionaries & Models

English (en)

Language Dictionary Phone Set Corpus G2P Model
en Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn

English (en-US)

Language Dictionary Phone Set Corpus G2P Model
en-US CMU Dict ARPA External Link bookbot/byt5-small-cmudict
en-US CMU Dict IPA IPA External Link
en-US CharsiuG2P IPA External Link charsiu/g2p_multilingual_byT5_small_100
en-US (Broad) Wikipron IPA External Link bookbot/byt5-small-wikipron-eng-latn-us-broad
en-US (Narrow) Wikipron IPA External Link
en-US LibriSpeech IPA Link

English (en-UK)

Language Dictionary Phone Set Corpus G2P Model
en-UK CharsiuG2P IPA External Link charsiu/g2p_multilingual_byT5_small_100
en-UK (Broad) Wikipron IPA External Link bookbot/byt5-small-wikipron-eng-latn-uk-broad
en-UK (Narrow) Wikipron IPA External Link

English (en-AU)

Language Dictionary Phone Set Corpus G2P Model
en-AU (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-au-broad
en-AU (Narrow) Wikipron IPA Link
en-AU AusTalk IPA Link
en-AU SC-CW IPA Link

English (en-CA)

Language Dictionary Phone Set Corpus G2P Model
en-CA (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-ca-broad
en-CA (Narrow) Wikipron IPA Link

English (en-NZ)

Language Dictionary Phone Set Corpus G2P Model
en-NZ (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-nz-broad
en-NZ (Narrow) Wikipron IPA Link

English (en-IN)

Language Dictionary Phone Set Corpus G2P Model
en-IN (Broad) Wikipron IPA Link bookbot/byt5-small-wikipron-eng-latn-in-broad
en-IN (Narrow) Wikipron IPA Link

Training G2P Model

We modified the sequence-to-sequence training script of 🤗 HuggingFace for the purpose of training G2P models. Refer to their installation requirements for more details.

Training a new G2P model generally follow this recipe:

python run_translation.py \
+   --model_name_or_path $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --output_dir $OUTPUT_DIR \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
+   --hub_model_id $HUB_MODEL_ID \
    --use_auth_token

Example: Fine-tune ByT5 on CMU Dict

python run_translation.py \
    --model_name_or_path google/byt5-small \
    --dataset_name bookbot/cmudict-0.7b \
    --output_dir ./byt5-small-cmudict \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
    --hub_model_id bookbot/byt5-small-cmudict \
    --use_auth_token

Evaluating G2P Model

Then to evaluate:

python eval.py \
+   --model $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Example: Evaluate ByT5 on CMU Dict

python eval.py \
    --model bookbot/byt5-small-cmudict \
    --dataset_name bookbot/cmudict-0.7b \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Corpus Roadmap

Wikipron

Language Family Code Region Corpus G2P Model
African English en-ZA South Africa
Australian English en-AU Australia
East Asian English en-CN, en-HK, en-JP, en-KR, en-TW China, Hong Kong, Japan, South Korea, Taiwan
European English en-UK, en-HU, en-IE United Kingdom, Hungary, Ireland 🚧 🚧
Mexican English en-MX Mexico
New Zealand English en-NZ New Zealand
North American en-CA, en-US Canada, United States
Middle Eastern English en-EG, en-IL Egypt, Israel
Southeast Asian en-TH, en-ID, en-MY, en-PH, en-SG Thailand, Indonesia, Malaysia, Philippines, Singapore
South Asian English en-IN India

Resources

References

@inproceedings{lee-etal-2020-massively,
    title = "Massively Multilingual Pronunciation Modeling with {W}iki{P}ron",
    author = "Lee, Jackson L.  and
      Ashby, Lucas F.E.  and
      Garza, M. Elizabeth  and
      Lee-Sikka, Yeonju  and
      Miller, Sean  and
      Wong, Alan  and
      McCarthy, Arya D.  and
      Gorman, Kyle",
    booktitle = "Proceedings of LREC",
    year = "2020",
    publisher = "European Language Resources Association",
    pages = "4223--4228",
}
@misc{zhu2022byt5,
    title={ByT5 model for massively multilingual grapheme-to-phoneme conversion}, 
    author={Jian Zhu and Cong Zhang and David Jurgens},
    year={2022},
    eprint={2204.03067},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}