Skip to content

HLTCHKUST/cantonese-asr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[Update][1 Feb, 2024] We release the .wav data of the dataset Google Drive Link. Please use it for research purposes only.

MDCC Dataset

This repository contains code and meta-data to download the How2 dataset as described in the following paper:

Tiezheng Yu and Rita Frieske and Peng Xu and Samuel Cahyawijaya and Cheuk Tung Shadow Yiu and Holy Lovenia and Wenliang Dai and Elham J. Barezi and Qifeng Chen and Xiaojuan Ma and Bertram E. Shi and Pascale Fung. "Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset" Link: https://arxiv.org/pdf/2201.02419.pdf

@misc{yu2022automatic,
      title={Automatic Speech Recognition Datasets in Cantonese: A Survey and New Dataset}, 
      author={Tiezheng Yu and Rita Frieske and Peng Xu and Samuel Cahyawijaya and Cheuk Tung Shadow Yiu and Holy Lovenia and Wenliang Dai and Elham J. Barezi and Qifeng Chen and Xiaojuan Ma and Bertram E. Shi and Pascale Fung},
      year={2022},
      eprint={2201.02419},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Download

  1. You can find the LICENSE named "MDCC_LICENSE" in is folder. Please sign the license and send it to chinatysonyu@gmail.com.
  2. Then you can download the data from this Google Drive Link.

Download checkpoints

Google Drive Link

How to run the code?

[TODO]