GitHub - Hecate2/sukasuka-vocal-dataset-builder: すかすかアニメボカロデータセット。1st anime vocal dataset. Extract audio (vocal) files from video based on .ass subtitle files; manually label vocal files to characters. Will be used for PITS/VITS/Diffusion text-to-speech/SVC. 根据字幕，从视频里抽取全部语音，然后手动按角色标注。

My Python codes in this repo are licensed in MIT. Be aware that the anime & subtitles & Python packages (e.g. ffmpeg) may have other licenses.

Salute to all the contributors!

Episodes 09 & 10 labeled by 亡絮开始·祖安钢琴师

Episodes 11 & 12 labeled by 喵る桑

Drama CD 01 subtitled & labeled by camimo

Experimental synthesis (see the .mp3 & .flac files in the release) and model training performed by Aya.

TTS model using ESPnet by mio.

Dataset of Chtholly checked by mio; Ithea checked by camimo.

If you are going to train your own model, pay attention that the dataset is further cleaned and released by mio at huggingface.co to remove non-vocal sounds, using demucs. My releases here STILL INCLUDES NON-VOCAL SOUNDS.

(Image created by Carzit using AI)

Contribution guides for potential Chthollists: Following Tasks!

All kinds of contributions from anyone are welcomed, while a perfectly ideal contributor needs to:

[THIS IS THE MOST IMPORTANT!] be familiar with SukaSuka characters, especially the sounds and personalities! At least you need to know their names... (head to releases to check the English names)
understand how AI models are trained, and why and how we are building datasets
know something about .csv, or other text-only formats like .json that are designed for both humans and machines
know about github, huggingface, civitai, etc.
be able to read or write basic programs
be familiar with AI-ops

Please always fire an issue mentioning what you are going to do before contributing, in case others may repeat (or have already repeated) your work for many times, wasting labor forces.

Verify meta.csv. Surely there are mistakes.
Filter out non-vocal sounds in the dataset
Mark vocal sounds that are not suitable for training, in meta.csv. This requires some training experience. For example, short and meaningless ああああ~ running away from the character's normal pitch may pollute the model.

How to build your dataset

Place your files like this

sukasuka-vocal-dataset-builder:
  get_voice_from_video_and_subtitles.py
  divide_by_character.py
  (Others...)
[MH&Airota&FZSD&VCB-Studio] Shuumatsu Nani Shitemasuka？ Isogashii Desuka？ Sukutte Moratte Ii Desuka？ [Ma10p_1080p]:
  [MH&Airota&FZSD&VCB-Studio] sukasuka [01][Ma10p_1080p][x265_flac_aac].mkv
  (Others...)
[XKsub] 終末なにしてますか [简日·繁日双语字幕]:
  [XKsub] 終末なにしてますか chs_jap:
    Shuumatsu Nani Shitemasuka 01.chs_jap.ass
    (Others...)

Run get_voice_from_video_and_subtitles.py, and then MANUALLY label all the characters in sukasuka-vocal-dataset-builder/meta.csv (format: filename,character,content; check if your csv file has the exact first line filename,character,content). Finally run divide_by_character.py.

Drama CD dataset...

WIP. If you are interested, run drama_cd_transcript.py, and manually edit drama-cd-transcript/drama-cd-transcript.csv.

Data sources

subtititles: https://bbs.acgrip.com/thread-6124-1-1.html (with AGPLv3 & CC BY-NC-SA 4.0 licenses)

anime videos: magnet:?xt=urn:btih:a05ba5cf6182e0757288c377fe8c06606a0f6428&dn=%5bMH%26Airota%26FZSD%26VCB-Studio%5d%20Shuumatsu%20Nani%20Shitemasuka%ef%bc%9f%20Isogashii%20Desuka%ef%bc%9f%20Sukutte%20Moratte%20Ii%20Desuka%ef%bc%9f%20%5bMa10p_1080p%5d&tr=udp%3a%2f%2ftracker.publicbt.com%3a80%2fannounce&tr=http%3a%2f%2ftr.bangumi.moe%3a6969%2fannounce&tr=http%3a%2f%2ft.nyaatracker.com%2fannounce&tr=http%3a%2f%2fopen.acgtracker.com%3a1096%2fannounce&tr=http%3a%2f%2fopen.nyaatorrents.info%3a6544%2fannounce&tr=http%3a%2f%2ft2.popgo.org%3a7456%2fannonce&tr=http%3a%2f%2fshare.camoe.cn%3a8080%2fannounce&tr=http%3a%2f%2fopentracker.acgnx.se%2fannounce&tr=http%3a%2f%2ftracker.acgnx.se%2fannounce&tr=http%3a%2f%2fnyaa.tracker.wf%3a7777%2fannounce&tr=udp%3a%2f%2ftracker.openbittorrent.com%3a80%2fannounce&tr=http%3a%2f%2ft.acg.rip%3a6699%2fannounce&tr=udp%3a%2f%2ftracker.prq.to%3a80%2fannounce&tr=http%3a%2f%2fshare.dmhy.org%2fannonuce&tr=http%3a%2f%2ftracker.btcake.com%2fannounce&tr=http%3a%2f%2ftracker.ktxp.com%3a6868%2fannounce&tr=http%3a%2f%2ftracker.ktxp.com%3a7070%2fannounce&tr=udp%3a%2f%2fbt.sc-ol.com%3a2710%2fannounce&tr=http%3a%2f%2fbtfile.sdo.com%3a6961%2fannounce&tr=https%3a%2f%2ft-115.rhcloud.com%2fonly_for_ylbud&tr=http%3a%2f%2fexodus.desync.com%3a6969%2fannounce&tr=udp%3a%2f%2fcoppersurfer.tk%3a6969%2fannounce&tr=http%3a%2f%2ftracker3.torrentino.com%2fannounce&tr=http%3a%2f%2ftracker2.torrentino.com%2fannounce&tr=udp%3a%2f%2fopen.demonii.com%3a1337%2fannounce&tr=udp%3a%2f%2ftracker.ex.ua%3a80%2fannounce&tr=http%3a%2f%2fpubt.net%3a2710%2fannounce&tr=http%3a%2f%2ftracker.tfile.me%2fannounce&tr=http%3a%2f%2fbigfoot1942.sektori.org%3a6969%2fannounce&tr=http%3a%2f%2fbt.sc-ol.com%3a2710%2fannounce

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
drama-cd-transcript		drama-cd-transcript
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
contributions-banner.png		contributions-banner.png
divide_by_character.py		divide_by_character.py
drama_cd_transcript.py		drama_cd_transcript.py
get_voice_from_video_and_subtitles.py		get_voice_from_video_and_subtitles.py
meta.csv		meta.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drama-cd-transcript

drama-cd-transcript

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

contributions-banner.png

contributions-banner.png

divide_by_character.py

divide_by_character.py

drama_cd_transcript.py

drama_cd_transcript.py

get_voice_from_video_and_subtitles.py

get_voice_from_video_and_subtitles.py

meta.csv

meta.csv

requirements.txt

requirements.txt

Repository files navigation

Salute to all the contributors!

Contribution guides for potential Chthollists: Following Tasks!

How to build your dataset

Drama CD dataset...

Data sources

About

Releases 4

Packages

Languages

License

Hecate2/sukasuka-vocal-dataset-builder

Folders and files

Latest commit

History

Repository files navigation

Salute to all the contributors!

Contribution guides for potential Chthollists: Following Tasks!

How to build your dataset

Drama CD dataset...

Data sources

About

Topics

Resources

License

Stars

Watchers

Forks

Languages