tusom2021

A phonetically transcribed speech dataset from East Tusom (an endangered Tibeto-Burman language of Northeast India).

Motivation

There is growing interest in universal phone recognition—creating ASR systems that can recognize speech from an arbitrary language as a sequence of phones, just as a trained field linguist can. However, there is a paucity of datasets that can be used for evaluating such systems. Tusom2021 is one step towards filling that gap. We hope that many scores of similar datasets will become available in the future.

Description

The data consists of a set of brief recordings (the WAV files in data/wav) and a YAML file (data/mapping.yml) that provides transcriptions and glosses for the WAV files. The YAML file consists of an object whose keys are the names of WAV files in data/wav and whose values are objects with the following fields:

"gloss": the associated gloss/translation
"no_tones": the transcriptions with no tones indicated
"tone_dias": the transcriptions with tones as combining diacritics
"tone_letters": the transcriptions with tones represented as Chao tone letters

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

tusom2021

Motivation

Description

About

Releases

Packages

Languages

License

dmort27/tusom2021

Folders and files

Latest commit

History

Repository files navigation

tusom2021

Motivation

Description

About

Resources

License

Stars

Watchers

Forks

Languages