GitHub - ofdn/Kathabhidhana: Open Source toolkit to record a large number of words in any language

Kathabhidhana consists of a few Free/Libre and Open Source Software, documentation to learn and use it, and open datasets that you can customize and shape your metadata.

Almost half of the 6909 living languages of the world are dying in a century’s time. In India alone, over 220 Indian languages out of the total of 780 languages have died only in last 50 years. With these languages, there dies a wealth of knowledge forever.

Kathabhidhana is an open toolkit to record a large number of words. It consists of a few free/libre and open source software, open datasets, methodologies and documentations. It can be used to record pronunciations of words to make a talking dictionary to record phonemes to create a text-to-speech software.

We truly believe in openness and the FLOSS philosophy. So every single component of this toolkit is open. It also contains other dependency FLOSS tools that are built by many kind people in the open source movement.

A tool with many faces Wikipedia has a sister project called Wiktionary, a multilingual dictionary where you can not just find meaning of words from your own language but also equivalent meanings of foreign language words. Unlike many available dictionaries that help learn proununciations, Wiktionary does not have pronunciations of all words in all the languages. Kathabhidhana was originally started by Subhashish Panigrahi to add pronunciations to the Odia-language Wiktionary. It is adopted from a free software created by by Shrinivasan T. It works both on Linux and Mac. The iOS version of Kathabhidhana was created by Prateek Pattanaik. You can certainly create pronunciations and add them to Wiktionary. But you can use Kathabhidhana beyond that by making a large library of pronunciations that can be used to build any machine learning or Natural Language Processing (NLP) tool.

Currently several Odia-language words are being recorded, uploaded on Wikimedia Commons, and are being used in Odia Wiktionary. Feel free to use this toolkit with attribution, and even forkand build something on the top of it.

An Odia version of the resources and tutorial is available here. We are currently working on building more tutorials so that you can learn more about bettering your home studio setup, tips and tricks about batch renaming files, cleaning up using open source tools like Audacity, setting up files for batch upload on Wikimedia Commons, etc. So stay tuned.

What Does this toolkit contain?

A recording tool (download for Linux/Mac and iOS, watch a video introduction to Kathabhidhana, watch a video tutorial for the iOS version)
Instruction manual to set up the hardware and software
Dependency tools
Audacity for a post-recording batch clean up (Download, you can also check this tutorial in English, and Odia to clean up vocals for individual recordings)
Pattypan for batch uploading recorded and edited audio files on Wikimedia Commons
Open dataset: CSV, .ods for reference while creating meta data for your recordings
Odia→International Phonetic Alphabet (IPA)/Roman converter for adding phonetic signs in the metadata while uploading. Thiis converter works only the Odia alphabet. But you can fork and create one for your writing system too.

Prerequisites

Using a computer?

Linux or macOS
Linux running in a virtual machine

Using an iOS device? (check more here)

iOS (iPad or iPhone)
An app called Workflow

How to use it?

Download and set up Kathabhidhana (see the next section)
Set up your recording hardware (see mine in the picture above) e.g. microphone (if using an external one), computer settings like level
Record using Kathabhidhana
Batch processing using (tutorial coming soon, download Audacity from here)
Manual clean up of each file (tutorial coming up soon)
Setting up Pattypan and upload files on Commons (download from here)

Setting up Kathabhidhana

(you need to run the command in Linux or Mac, or Linux in a virtual machine if you're on Windows) Read in Odia

Fill the words you want to record in a textfile named "file"
run the below command

First dive into the folder, for instance it is the "Kathabhidhana" folder under "Documents for me:

Then run:

python voice-record.py

The next steps are quite self-explanatory. You need to choose "Y" for yes and "N" for no in the following options inside your terminal.

To upload all the ogg files to Wikimedia Commons This will record the sounds in .ogg and .wav formats. You can then use a tool like Pattypan to batch-upload either the .WAV or the .ogg files on Wikimedia Commons.

Findings so far

• It takes about 20-25 mins to record 100 words; A batch processing to convert and do overall auto-cleanup using Audacity will take about 5 mins for a 100-word-batch; It takes an average of 30 secs for 1 word to manually clean up, check quality, trim extra portions and other such editing work (meaning it will take about 45 mins to clean up a batch of 100 words) using Audacity; It takes about 5-10 mins for setting up Pattypan to upload the cleaned up words on Wikimedia Commons; On an average one would spend roughly about 1.5 hrs from recording to cleaning up to uploading for a batch of 100 words

Other useful resources

Pronuncify by Asaf Bartov, a Kathabhidhana-like command line tool for both Linux/Mac and Windows.
LinguaLibre, a web and GUI-based tool developed by the Wikimedia France for a similar workflow like Kathabhidhana with more functions

Attribution

Project led by Subhashish Panigrahi and the iOS tool is led by Prateek Pattanaik. All the media and text content are available under a CC-BY-SA 4.0 license
All the software component is licensed under GNU General Public License (GPL) version 3 (read the License page for more details)
This project and part of the documentation are based on the Voice recorder for Tawiktionary project created by Shrinivasan T (please attribute Shrinivasan T if you're making a derivative of the software)

Blogs/media shoutouts

Panigrahi, Subhashish. "A simple command-line tool for recording audio". Opensource.com (May 12, 2017)
Ojha, Bikash. Mishra, Chinmayee. Pattanaik, Prateek. Panigrahi, Subhashish. Patnaik, Sailesh. Elsharbaty, Samir. Community digest: As Odia Wikisource turns two, a project to digitize rare books kicks off; news in brief. Wikimedia Blog (March 30, 2017)
Rezwan. "A New Audio Uploading Tool for Crowdsourced Wiktionary Project in Odia Language". Global Voices (February 13, 2017)

Talks/workshops

Workshop "Kathabhidhana: Recording words for Wiktionary and preparing for an AI assistant (selected for Wikimania 2017, Montreal, Canada. Please join if you are attending Wikimania this year. Check back in late August for more updates about the workshop.")

Name		Name	Last commit message	Last commit date
Latest commit History 125 Commits
Kathabhidhana for iOS		Kathabhidhana for iOS
License.md		License.md
README.md		README.md
_config.yml		_config.yml
completed_words		completed_words
err		err
file		file
mediawiki-uploader.py		mediawiki-uploader.py
record.py		record.py
voice-record.py		voice-record.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kathabhidhana for iOS

Kathabhidhana for iOS

License.md

License.md

README.md

README.md

_config.yml

_config.yml

completed_words

completed_words

err

err

file

file

mediawiki-uploader.py

mediawiki-uploader.py

record.py

record.py

voice-record.py

voice-record.py

Repository files navigation

Kathabhidhana consists of a few Free/Libre and Open Source Software, documentation to learn and use it, and open datasets that you can customize and shape your metadata.

What Does this toolkit contain?

Prerequisites

How to use it?

Setting up Kathabhidhana

Findings so far

Attribution

Blogs/media shoutouts

Talks/workshops

About

Releases

Packages

Languages

License

ofdn/Kathabhidhana

Folders and files

Latest commit

History

Repository files navigation

Kathabhidhana consists of a few Free/Libre and Open Source Software, documentation to learn and use it, and open datasets that you can customize and shape your metadata.

What Does this toolkit contain?

Prerequisites

How to use it?

Setting up Kathabhidhana

Findings so far

Attribution

Blogs/media shoutouts

Talks/workshops

About

Topics

Resources

License

Stars

Watchers

Forks

Languages