Skip to content

yc9701/pansori-tedxkr-corpus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

License: CC BY-NC-ND 4.0

Pansori TEDxKR Corpus

The Pansori TEDxKR Corpus is a Korean speech recognition (ASR) corpus generated from Korean language TEDx talks given in Korea from 2010 to 2014. It contains about 3 hours of speech audio-transcript pairs from 41 speakers.

This corpus was generated by using a new corpus data ingestion and processing system called Pansori. Please refer to this code repository and the following paper for further information on the Pansori ASR corpus generation system:

Extra care was taken to maintain the quality of the generated corpus:

  • Only TEDx talks hand transcribed by community translators were included.
  • Corpus fragments were segmented at subtitle boundaries.
  • Fine tuning segmentation by manual (tool-assisted) speech-text alignment.
  • Final validation by state-of-the-art speech recognizer (Google Cloud Speech-To-Text).

The speech audio included in the corpus are 16 bit FLAC files with sampling rate of 16 KHz. Further information on the included speech contents is summarized in the following table:

Title Speaker Gender Location Year Fragments Duration
Appropriate technology 이성범 M Seoul 2010 87 5:58
Making a village worth living in 김혜정 F Busan 2012 191 9:14
The true owner of land 남기업 M Busan 2012 155 6:43
Starting from where I am 황두진 M Seoul 2010 117 6:41
Telling the new story in the old form 이자람 F Seoul 2010 92 7:50
Dreaming a way to future aerial vehicle from unmanned aircraft 구삼옥 M Daedeok 2011 121 7:34
Misconception about evaluations 유정식 M Busan 2012 158 6:43
Be an artist, right now! 김영하 M Seoul 2013 131 5:47
Communication is recovery 박임순 F Busan 2012 161 6:24
Jeju Olleh 서명숙 F Seoul 2010 135 9:16
DIY OOOSSSZZZ band 유상준 M Seoul 2010 44 2:22
Dynamic biology 이선희 F Daedeok 2011 68 4:44
Active immersion in thinking 황농문 M Daejeon 2012 84 5:01
Becoming a good-earthling 이현정 F Busan 2011 95 3:53
More humane medical experience 김승범, 정혜진 M, F Seoul 2010 80 4:36
Finding new energy to overcome resource limits 이경수 M Daejeon 2010 53 4:43
Which do you love, pictures or camera? 박희진 M Busan 2014 38 2:42
Every citizen is a journalist 오연호 M Seoul 2010 61 4:10
Take time to imagine the world to rights 윤한결 M Busan 2013 126 5:01
With feeling the aesthetics of slowness 이상은 F Daejeon 2011 29 3:45
Beating disabilities to pioneer grassroots journalism 조주현 M Daejeon 2010 37 3:56
Statistics 3.0 이인실 F Busan 2011 94 3:42
Why Analytical Science? 정광화 F Daedeok 2011 58 3:56
Redefinition of soil and its possibilities 신근식 M Busan 2011 76 3:51
Predict disease with face 김종열 M Daedeok 2011 72 4:08
Sustainable DoReMi 고건혁 M Seoul 2010 78 3:10
ITER, towards the dream of a fusion energy era 정기정 M Daedeok 2010 45 3:35
Winning the world with the 'DID' mindset 송수용 M Daejeon 2010 66 3:19
Social venture is blue ocean 김정현 M Busan 2011 60 2:56
No prerequisite learning, no worry 신현승 M Busan 2012 49 2:44
Passion and challenge 신창연 M Busan 2011 88 2:46
Are science and liberal arts equal? 김상욱 M Busan 2013 67 2:36
Perspective, music and life 다이나믹듀오 M Seoul 2012 48 2:51
아이티 구호현장에서 발견한 음식의 가치 김재학 M Seoul 2010 8 0:25
A spirit of sharing information and culture 'CC' 최진권 M Daejeon 2010 18 1:42
Gibbons, long-armed apes 김산하 M Seoul 2010 73 2:22
Never let go of your passion, just keep working on it 김대식 M Daejeon 2010 23 1:50
Inconvenient truth of Korean Web 김기창 M Busan 2012 37 1:52
Statecraft, the art of conducting public affairs 윤여준 M Seoul 2010 46 1:59
Korean traditional hawk hunting 박용순 M Daejeon 2011 21 1:09
Multiple identity diaspora 김경묵 M Seoul 2010 1 0:12

The corpus can be downloaded either individually or as a whole from the GitHub repository. Alternatively, they are also available for download in one single archive file in the following link: https://storage.googleapis.com/pansori/corpus/pansori-tedxkr-corpus-1.0.tar.gz [170MB].

We are currently preparing a large-sized Korean language ASR corpus by further automating the data processing pipeline used to generate this TEDxKR corpus. The new Korean ASR corpus will also be released under a permissive license once we confirm the types of license with the license holder.

Releases

No releases published

Packages

No packages published