Skip to content

LAAC-LSCP/TsimaneForcedAligner

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TsimaneForcedAligner

A forced aligner for Tsimane language. This repository contains also many interesting things for tsimane, such as a phonemizer, phonetic dictionary, etc. and can be used for other purposes.

Working environment

Clone this github repository:

git clone https://github.com/yaya-sy/TsimaneForcedAligner.git

and move to it:

cd TsimaneForcedAligner

You can create the conda environment if you want to donwnload the bible corpus:

conda env create -f environment.yml

and activate it:

conda activate tsimane-scraper

Aligning the bible corpus

We release the file data/timemarks.txt containing audio timemarks for each verse of the bible corpus. It's a tab-separated file:

filename    verse_line_id   onset   offset

The lines with onset = offset = 0.0 are unaligned verses, you can ignore them.

You can donwload the bible corpus using the script scripts/download_bible.py, as:

python scripts/download_bible.py --page live.bible.is/bible/CASNTM/MRK/1 --output-directory data

Note that the source code of the web page or the links may change, so this scraper may become obsolete.

Align your own corpus

To align a corpus you need:

  • a speech corpus: folder containing your audios and their corresponding texts (they must have the same filenames).
  • a acoustic model: We release a pretrained acoustic model for aligning a new corpus. This model is pretrained on the bible corpus and is located in models/all_non_merged_glottal.zip
  • a phonetic dictionary: it's a vocabulary of the language mapping each word to its phonetic realization. You can find a phonetic dictionary created with the bible corpus of Tsimane in data/vocabularies/bible_vocabulary.dict. But you can also phonemize your own vocabulary using this script: scripts/phonemizer.py

To align your speech corpus, you will need to install the Montreal Forced Aligner.

After installation, you can align your corpus:

mfa align <your-speech-corpus> <your-phonetic-dictionary> models/tsimane_acoustic_model.zip  <output-folder> --clean --overwrite --temp_directory aligners/wnh_tsimane --num_jobs 1

About

A forced aligner for Tsimane language

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%