Lexikos - λεξικός /lek.si.kós/

A collection of pronunciation dictionaries and neural grapheme-to-phoneme models.

Install Lexikos

Install from PyPI

pip install lexikos

Editable install from Source

git clone https://github.com/bookbot-hive/lexikos.git
pip install -e lexikos

Usage

Lexicon

>>> from lexikos import Lexicon
>>> lexicon = Lexicon()
>>> print(lexicon["added"])
{'ˈæ d ɪ d', 'ˈæ ɾ ə d', 'æ ɾ ɪ d', 'a d ɪ d', 'ˈa d ɪ d', 'æ ɾ ə d', 'ˈa d ə d', 'a d ə d', 'ˈæ d ə d', 'æ d ə d', 'æ d ɪ d', 'ˈæ ɾ ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'ɹ ʌ n ɝ', 'ˈr ʌ n ɝ'}
>>> print(lexicon["water"])
{'ˈʋ aː ʈ ə r ɯ', 'ˈw oː t ə', 'w ɑ t ə ɹ', 'ˈw aː ʈ ə r ɯ', 'ˈw ɔ t ɝ', 'w ɔ t ə ɹ', 'ˈw ɑ t ə ɹ', 'w ɔ t ɝ', 'w ɑ ɾ ɚ', 'ˈw ɑ ɾ ɚ', 'ˈʋ ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɔː t ə', 'ˈw oː ɾ ə', 'ˈw ɔ ʈ ə r'}

To get a lexicon where phonemes are normalized (diacritics removed, digraphs split):

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(normalize_phonemes=True)
>>> print(lexicon["added"])
{'æ ɾ ɪ d', 'a d ɪ d', 'a d ə d', 'æ ɾ ə d', 'æ d ə d', 'æ d ɪ d'}
>>> print(lexicon["runner"])
{'ɹ ʌ n ɚ', 'ɹ ʌ n ə', 'r ʌ n ɝ', 'ɹ ʌ n ɝ'}
>>> print(lexicon["water"])
{'w o ɾ ə', 'w ɔ t ə', 'ʋ ɔ ʈ ə r', 'w a ʈ ə r ɯ', 'w ɔ t ə ɹ', 'ʋ a ʈ ə r ɯ', 'w ɑ ɾ ɚ', 'w o t ə', 'w ɔ t ɝ', 'w ɔ ʈ ə r', 'w ɔ ɾ ɚ', 'w ɑ t ə ɹ'}

To include synthetic (non-dictionary-based) pronunciations:

>>> from lexikos import Lexicon
>>> lexicon = Lexicon(include_synthetic=True)
>>> print(lexicon["athletic"])
{'æ t l ɛ t ɪ k', 'æ θ ˈl ɛ t ɪ k', 'æ θ l ɛ t ɪ k'}

Phonemization

>>> from lexikos import G2p
>>> g2p = G2p(lang="en-us")
>>> g2p("Hello there! $100 is not a lot of money in 2023.")
['h ɛ l o ʊ', 'ð ɛ ə ɹ', 'w ʌ n', 'h ʌ n d ɹ ɪ d', 'd ɑ l ɚ z', 'ɪ z', 'n ɒ t', 'ə', 'l ɑ t', 'ʌ v', 'm ʌ n i', 'ɪ n', 't w ɛ n t i', 't w ɛ n t i', 'θ ɹ iː']
>>> g2p = G2p(lang="en-au")
>>> g2p("Hi there mate! Have a g'day!")
['h a ɪ', 'θ ɛ ə ɹ', 'm e ɪ t', 'h e ɪ v', 'ə', 'ɡ ə ˈd æ ɪ']

Dictionaries & Models

English `(en)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn

English `(en-US)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-US	CMU Dict	ARPA	External Link	bookbot/byt5-small-cmudict
en-US	CMU Dict IPA	IPA	External Link
en-US	CharsiuG2P	IPA	External Link	charsiu/g2p_multilingual_byT5_small_100
en-US (Broad)	Wikipron	IPA	External Link	bookbot/byt5-small-wikipron-eng-latn-us-broad
en-US (Narrow)	Wikipron	IPA	External Link
en-US	LibriSpeech	IPA	Link

English `(en-UK)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-UK	CharsiuG2P	IPA	External Link	charsiu/g2p_multilingual_byT5_small_100
en-UK (Broad)	Wikipron	IPA	External Link	bookbot/byt5-small-wikipron-eng-latn-uk-broad
en-UK (Narrow)	Wikipron	IPA	External Link

English `(en-AU)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-AU (Broad)	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn-au-broad
en-AU (Narrow)	Wikipron	IPA	Link
en-AU	AusTalk	IPA	Link
en-AU	SC-CW	IPA	Link

English `(en-CA)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-CA (Broad)	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn-ca-broad
en-CA (Narrow)	Wikipron	IPA	Link

English `(en-NZ)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-NZ (Broad)	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn-nz-broad
en-NZ (Narrow)	Wikipron	IPA	Link

English `(en-IN)`

Language	Dictionary	Phone Set	Corpus	G2P Model
en-IN (Broad)	Wikipron	IPA	Link	bookbot/byt5-small-wikipron-eng-latn-in-broad
en-IN (Narrow)	Wikipron	IPA	Link

Training G2P Model

We modified the sequence-to-sequence training script of 🤗 HuggingFace for the purpose of training G2P models. Refer to their installation requirements for more details.

Training a new G2P model generally follow this recipe:

python run_translation.py \
+   --model_name_or_path $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --output_dir $OUTPUT_DIR \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
+   --hub_model_id $HUB_MODEL_ID \
    --use_auth_token

Example: Fine-tune ByT5 on CMU Dict

python run_translation.py \
    --model_name_or_path google/byt5-small \
    --dataset_name bookbot/cmudict-0.7b \
    --output_dir ./byt5-small-cmudict \
    --per_device_train_batch_size 128 \
    --per_device_eval_batch_size 32 \
    --learning_rate 2e-4 \
    --lr_scheduler_type linear \
    --warmup_ratio 0.1 \
    --num_train_epochs 10 \
    --evaluation_strategy epoch \
    --save_strategy epoch \
    --logging_strategy epoch \
    --max_source_length 64 \
    --max_target_length 64 \
    --val_max_target_length 64 \
    --pad_to_max_length True \
    --overwrite_output_dir \
    --do_train --do_eval \
    --bf16 \
    --predict_with_generate \
    --report_to tensorboard \
    --push_to_hub \
    --hub_model_id bookbot/byt5-small-cmudict \
    --use_auth_token

Evaluating G2P Model

Then to evaluate:

python eval.py \
+   --model $PRETRAINED_MODEL \
+   --dataset_name $DATASET_NAME \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Example: Evaluate ByT5 on CMU Dict

python eval.py \
    --model bookbot/byt5-small-cmudict \
    --dataset_name bookbot/cmudict-0.7b \
    --source_text_column_name source \
    --target_text_column_name target \
    --max_length 64 \
    --batch_size 64

Corpus Roadmap

Wikipron

Language Family	Code	Region	Corpus	G2P Model
African English	en-ZA	South Africa
Australian English	en-AU	Australia	✅	✅
East Asian English	en-CN, en-HK, en-JP, en-KR, en-TW	China, Hong Kong, Japan, South Korea, Taiwan
European English	en-UK, en-HU, en-IE	United Kingdom, Hungary, Ireland	🚧	🚧
Mexican English	en-MX	Mexico
New Zealand English	en-NZ	New Zealand	✅	✅
North American	en-CA, en-US	Canada, United States	✅	✅
Middle Eastern English	en-EG, en-IL	Egypt, Israel
Southeast Asian	en-TH, en-ID, en-MY, en-PH, en-SG	Thailand, Indonesia, Malaysia, Philippines, Singapore
South Asian English	en-IN	India	✅	✅

Resources

References

@inproceedings{lee-etal-2020-massively,
    title = "Massively Multilingual Pronunciation Modeling with {W}iki{P}ron",
    author = "Lee, Jackson L.  and
      Ashby, Lucas F.E.  and
      Garza, M. Elizabeth  and
      Lee-Sikka, Yeonju  and
      Miller, Sean  and
      Wong, Alan  and
      McCarthy, Arya D.  and
      Gorman, Kyle",
    booktitle = "Proceedings of LREC",
    year = "2020",
    publisher = "European Language Resources Association",
    pages = "4223--4228",
}

@misc{zhu2022byt5,
    title={ByT5 model for massively multilingual grapheme-to-phoneme conversion}, 
    author={Jian Zhu and Cong Zhang and David Jurgens},
    year={2022},
    eprint={2204.03067},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
assets		assets
examples		examples
lexikos		lexikos
mfa_g2p		mfa_g2p
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

License

bookbot-hive/lexikos

Folders and files

Latest commit

History

Repository files navigation

Lexikos - λεξικός /lek.si.kós/

Install Lexikos

Usage

Lexicon

Phonemization

Dictionaries & Models

English (en)

English (en-US)

English (en-UK)

English (en-AU)

English (en-CA)

English (en-NZ)

English (en-IN)

Training G2P Model

Example: Fine-tune ByT5 on CMU Dict

Evaluating G2P Model

Example: Evaluate ByT5 on CMU Dict

Corpus Roadmap

Wikipron

Resources

References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

English `(en)`

English `(en-US)`

English `(en-UK)`

English `(en-AU)`

English `(en-CA)`

English `(en-NZ)`

English `(en-IN)`