Skip to content

getalp/mass-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 

Repository files navigation

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

This is the repository for the CMU multilingual speech extension data set presented in the paper entitled MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible.

Data

For copyright reasons, we are not allowed to share the audio files however, we provide the extraction pipeline below. We also highlight this pipeline can be used to new languages of interested. Inside the dataset folder, for each language we provide:

Pipeline

1) Downloading audio chapters from bible.is.

1.1. The audios used in our work are available in the following links:

1.2. The audios were converted from multi to single channel and forced aligned by using this script.

1.3. The raw chapter text files are not available for download anymore at the website. Thus, we provide them at dataset/LANGUAGE/raw_txt/. For new languages, chapter text files can be extracted from this webpage. These .txt files (chapter level) should be put on the same folder than the audios.

2) Aligning the data with Maus forced aligner

For the covered languages, we make available the output from the Maus forced aligner in LANGUAGE/maus_textgrid/. For new languages, please check the Website.

3) Obtaining speech alignment on a verse level

For each language, the audios were sliced in verses considering the output of 1.3. and the generated texgrids (2.). More details available here.

4) ID equivalence across languages

For translating the IDs in English, we provide the simple python script below.

python3 scripts/fetch_data.py <language folder> <output folder> <language code>

5) Generate a CSV file listing the verses available for each language

Use this script to tenerate a CSV files listing the verses available for each language. As not all the verses of a given language exist in another language, this CSV file can be use to get a list of verses common to all languages.

Paper Experiments

The speech-to-speech retrieval baseline model proposed at the paper is available here.

Citation

If you use this corpus in your experiments, please use the following bibtex for citation

@inproceedings{zanon-boito-etal-2020-mass, title = {{M}a{SS}: {A} {L}arge and {C}lean {M}ultilingual {C}orpus of {S}entence-aligned {S}poken {U}tterances {E}xtracted from the {B}ible}, author = {Zanon Boito*, Marcely and Havard*, William and Garnerin, Mahault and Le Ferrand, Éric and Besacier, Laurent}, booktitle = {Proceedings of the 12th Language Resources and Evaluation Conference}, month = may, year = {2020}, address = {Marseille, France}, publisher = {European Language Resources Association}, url = {https://aclanthology.org/2020.lrec-1.799}, pages = {6486--6493}, language = {English}, isbn = {979-10-95546-34-4}, }

Team and Contact

The people behind the (325) project:

You can contact them at first.last-name@univ-grenoble-alpes.fr

About

MaSS - Multilingual corpus of Sentence-aligned Spoken utterances

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •