Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIDI SVS mode may produce uneven rhythms #60

Open
yqzhishen opened this issue Jul 28, 2022 · 0 comments
Open

MIDI SVS mode may produce uneven rhythms #60

yqzhishen opened this issue Jul 28, 2022 · 0 comments
Labels
enhancement New feature or request must-read

Comments

@yqzhishen
Copy link

Hello and thank you for your great work. However, I tried MIDI SVS of DiffSinger and found that there might be a conceptual mistake in the phoneme duration inference logic, which may lead to uneven rhythms of the output voice.
This possible mistake relates to the definitions of "note duration". Here I would like to show several examples.

Explanations of the duration of notes

As shown in the picture below, the duration of a note (containing one single syllable) is normally defined as the duration between the beginning of its vowel part and the beginning of the vowel part of the next note.
duration_of_notes

That is to say, notes begin at the beginning of their VOWEL parts, not their CONSONANT parts (as notes in MIDI SVS of DiffSinger currently do). When we sing, the rhythm sounds correct because every vowel starts on its right place, but not because consonants do; in fact, the length of consonants may affect the strength we feel, but theoretically not the rhythm.

Consequences of this kind of inconsistency

This kind of inconsistency can lead to chaotic rhythms. Take the demo lyric "小酒窝长睫毛是你最美的记号" for an example, and here is its music score:
music_score

Thus, we input:

input text
小 酒 窝 长 睫 毛 SP 是 你 最 美 的 记 号

input note
C#4 | F#4 | G#4 | A#4 F#4 | F#4 C#4 | C#4 | rest | C#4 | A#4 | G#4 | A#4 | G#4 | F#4 | C#4 

input duration
0.315789 | 0.315789 | 0.315789 | 0.315789 0.315789 | 0.315789 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 | 0.315789 

The output audio sounds wired and is probably not sung in rhythm ("小酒窝_diffsinger_raw.wav" in the attachment).

I then used other algorithm to predict the duration of each phone, and tried to fix this incorrect rhythm:

input text
小 酒 窝 长 睫 毛 SP 是 你 最 美 的 记 号

input note
C#4 | F#4 | G#4 | A#4 F#4 | F#4 C#4 | C#4 | rest | C#4 | A#4 | G#4 | A#4 | G#4 | F#4 | C#4 

input duration
0.390789 | 0.375789 | 0.25579 | 0.420789 0.210789 | 0.420789 0.21079 | 0.420789 | 0.13579 | 0.405789 | 0.30079 | 0.330789 | 0.36079 | 0.25579 | 0.315789 | 0.42079 

The output audio sounds much better ("小酒窝_diffsinger_ fixed_phone_durations.wav" in the attachment).
However, as only the beginning of consonant parts, but not the vowel parts, can be specified in MIDI SVS mode of DiffSinger, we may never get correct rhythms (in theory).

As a comparison, I produced a piece of audio with X Studio (Xiaoice Sing) that has the correct rhythm ("小酒窝_xiaoicesing_correct_rhythm.wav" in the attachment).

Here are the audios: audios.zip

My expectations

My teammates and I are trying to bring DiffSinger to more ordinary fans and users of SVS technology and products. These people (or you can say, most people) are more familiar with the interaction mode that takes notes or music scores as input. Therefore, correct rhythms are important and can help a lot.
It helps a lot if you fix the issue in rhythms (i. e. specify the beginning of vowels and predict the duration of the consonants).
I'm looking forward to your improvements.

@MoonInTheRiver MoonInTheRiver pinned this issue Mar 26, 2023
@MoonInTheRiver MoonInTheRiver added enhancement New feature or request documentation Improvements or additions to documentation must-read and removed documentation Improvements or additions to documentation labels Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request must-read
Projects
None yet
Development

No branches or pull requests

2 participants