Skip to content

A tool for generating sub-word (phone or grapheme) level embeddings from an HTK-style MLF ASR corpus

License

Notifications You must be signed in to change notification settings

alecokas/subword-embedding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

subword-embedding

A tool for generating sub-word (phone or grapheme) level embeddings from an HTK-style MLF ASR corpus as used in https://github.com/alecokas/BiLatticeRNN-Confidence

The ground truth transcription from the audio recording (*.mlf) is required to build the corpus. Additionally a summary file mapping subword units to the respective phonetic pronounciation or Latin representation is required. An example snippet for the Georgian language is provided below:

ა G1;D1 GEORGIAN LETTER AN
ვ G2;D1 GEORGIAN LETTER VIN
ს G3;D1 GEORGIAN LETTER SAN
...

Usage

Run the following command:

python embed_subwords.py [arguments]

Dependencies

  • python 3.6.3
  • numpy 1.14.0
  • matplotlib 2.1.2
  • scikit-learn 0.19.1

About

A tool for generating sub-word (phone or grapheme) level embeddings from an HTK-style MLF ASR corpus

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published