[Help]: 4.0 不工作。在转换后的音频中引入不需要的失真。源音高未正确转换。4.0 Not working. introducing unwanted distortion in converted audio. source pitch not properly converted. #89

MuruganR96 · 2023-03-26T09:57:16Z

Please check the checkboxes below.

I have read README.md and Quick solution in wiki carefully.
I have been troubleshooting issues through various search engines. The questions I want to ask are not common.
I am NOT using one click package / environment package.

OS version

Linux e2e-99-151 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

GPU

+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 8324MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 4 0 1 | 6MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 5 0 2 | 6MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 6 0 3 | 2290MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 3 0 16814 C /root/anaconda3/bin/python 8312MiB |
| 0 6 0 5768 C python3 2276MiB |
+-----------------------------------------------------------------------------+

Python version

Python 3.8.16

PyTorch version

Name: torch Version: 1.13.1 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /root/anaconda3/envs/SOVITS/lib/python3.8/site-packages Requires: nvidia-cublas-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cudnn-cu11, typing-extensions Required-by: fairseq, torchaudio, torchvision, triton

Branch of sovits

4.0(Default)

Dataset source (Used to judge the dataset quality)

Recorded in recording studio

Where thr problem occurs or what command you executed

python inference_main.py -m "logs/44k/G_200000.pth" -c "configs/config.json" -n "source.wav" -t 0 -s "aki" -a -cr 0.5

Problem description

introducing unwanted distortion in converted audio. source pitch not properly converted.

4.0 version not working properly. Please help me

4.0 不工作。在转换后的音频中引入不需要的失真。源音高未正确转换。

请帮我

Log

python inference_main.py -m "logs/44k/G_200000.pth" -c "configs/config.json" -n "source.wav" -t 0 -s "aki" -a -cr 0.5

load model(s) from hubert/checkpoint_best_legacy_500.pt                                                                 
INFO:fairseq.tasks.text_to_speech:Please install tensorboardX: pip install tensorboardX                                 
INFO:fairseq.tasks.hubert_pretraining:current directory is /root/Experiments/NewExperiments/so-vits-svc-4.0-mean-spk-emb
INFO:fairseq.tasks.hubert_pretraining:HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', '
fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': Fals
e, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target':
 False, 'random_crop': True, 'pad_audio': False}                                                                        
INFO:fairseq.models.hubert.hubert:HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default,
 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activati
on_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_l
ayerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_
first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_tem
p': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, '
mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'm
ask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1
, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': Fals
e, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', '
pos_enc_type': 'abs', 'fp16': False}                                                                                    
load                                                                                                                    
INFO:root:Loaded checkpoint 'logs/44k/G_200000.pth' (iteration 574)                                                     
spk_list ===>  ['sai_dharam_tej']                                                                                       
#=====segment start, 5.82s======                                                                                        
vits use time:1.1171600818634033                                                                                        
#=====segment start, 3.4s======                                                                                         
vits use time:0.19401907920837402                                                                                       
#=====segment start, 0.025s======                                                                                       
jump empty segment                                                                                                      
#=====segment start, 4.6s======                                                                                         
vits use time:0.269317626953125                                                                                         
#=====segment start, 5.0s======                                                                                         
vits use time:0.1455228328704834                                                                                        
#=====segment start, 4.28s======                                                                                        
vits use time:0.12343764305114746                                                                                       
#=====segment start, 0.012s======                                                                                       
jump empty segment                                                                                                      
#=====segment start, 8.92s======                                                                                        
vits use time:0.17324042320251465                                                                                       
#=====segment start, 5.66s======                                                                                        
vits use time:0.14951491355895996                                                                                       
#=====segment start, 1.66s======                                                                                        
vits use time:0.1753239631652832                                                                                        
#=====segment start, 0.013s======                                                                                       
jump empty segment                                                                                                      
#=====segment start, 7.78s======                                                                                        
vits use time:0.14623379707336426                                                                                       
#=====segment start, 6.74s======                                                                                        
vits use time:0.12516403198242188                                                                                       
#=====segment start, 0.009s======                                                                                       
jump empty segment                                                                                                      
#=====segment start, 5.1s======                                                                                         
vits use time:0.18128347396850586                                                                                       
#=====segment start, 0.002s======                                                                                       
jump empty segment
#=====segment start, 6.12s======
vits use time:0.1354503631591797
#=====segment start, 6.46s======
vits use time:0.1435403823852539
#=====segment start, 7.08s======
vits use time:0.12348651885986328
#=====segment start, 4.66s======
vits use time:0.1376965045928955
#=====segment start, 5.56s======
vits use time:0.1471116542816162
#=====segment start, 0.013s======
jump empty segment
#=====segment start, 7.1s======
vits use time:0.18711566925048828
#=====segment start, 0.01s======
jump empty segment
#=====segment start, 5.484s======
vits use time:0.13010954856872559
#=====segment start, 6.54s======
vits use time:0.11691427230834961
#=====segment start, 8.96s======
vits use time:0.20655536651611328
#=====segment start, 6.52s======
vits use time:0.17476463317871094
#=====segment start, 5.14s======
vits use time:0.18201518058776855
#=====segment start, 5.56s======
vits use time:0.10942959785461426
#=====segment start, 13.42s======
vits use time:0.26901769638061523
#=====segment start, 0.009s======
jump empty segment

Screenshot `so-vits-svc` and `logs/44k` folders and paste here

SOURCE

CONVERTED

Supplementary description

Almost all the converted samples pitch not converted properly. it is a major issue. please check this immediately friends

几乎所有转换后的样本音高都没有正确转换。这是一个重大问题。请朋友们立即查看

The text was updated successfully, but these errors were encountered:

NaruseMioShirakana · 2023-03-28T14:49:41Z

试着去设置一下noise_scale和seed

MuruganR96 · 2023-03-28T15:01:58Z

@NaruseMioShirakana 非常感谢您的精彩回复

By default taking noice_scale=0.4 and seed means which argument do I need to set?

默认情况下采用 noice_scale=0.4 和 seed 意味着我需要设置哪个参数？

MuruganR96 · 2023-03-28T19:04:00Z

@NaruseMioShirakana 我尝试使用 noise_scale 0.0, 0.1, 0.4, 0.7, 1.0。但仍然是同样的问题。源音调未正确转换。失真发生

@NaruseMioShirakana @Erythrocyte3803 请帮助我，还有什么造成了问题？而且我觉得自然度不好。转换后的样本有点像机器人的味道。

I tried with noise_scale 0.0, 0.1, 0.4, 0.7, 1.0. but still the same issue. The source pitch is not converted correctly. distortion occurring

@NaruseMioShirakana @Erythrocyte3803 please help me, what else creating the problem? and I feel the naturalness is not good. converted samples coming kind of robotic flavor.

NaruseMioShirakana · 2023-03-31T05:30:32Z

@NaruseMioShirakana 我尝试使用 noise_scale 0.0, 0.1, 0.4, 0.7, 1.0。但仍然是同样的问题。源音调未正确转换。失真发生

@NaruseMioShirakana @Erythrocyte3803 请帮助我，还有什么造成了问题？而且我觉得自然度不好。转换后的样本有点像机器人的味道。

I tried with noise_scale 0.0, 0.1, 0.4, 0.7, 1.0. but still the same issue. The source pitch is not converted correctly. distortion occurring

@NaruseMioShirakana @Erythrocyte3803 please help me, what else creating the problem? and I feel the naturalness is not good. converted samples coming kind of robotic flavor.

可以发一个示例音频吗

xiancaoro · 2023-04-05T14:02:56Z

关于”不需要的失真“以及”高音未正常转换“的问题，我在最初也有遇见过，这是我的解决方案，不知道是否能够帮助到你：
首先，我优化了数据集的质量，我选取了250条10s~20s的纯人声语音，最大程度的剔除了背景杂音与不够清晰的人声。数据集的质量越高，最后模型的效果越好。（最初，我只选取了40条15s的歌唱干声，数据集存在失真、不够清晰、伴奏未完全去除、存在部分和音的问题，优化后的数据集舍弃了所有这类数据）。
其次，用于推理的音频本身也需要保障高质量，存在和声未去除干净或高音不清晰的音频会严重影响推理结果。
最后，我的推理均采用了默认参数，为了保障契合音色，没有采用聚合模型。
补充说明：我训练的迭代次数是12000次，并没有与其他次数的模型做对比。
希望我个人的经历可以帮助到您！

xiancaoro · 2023-04-05T14:15:07Z

关于”不需要的失真“以及”高音未正常转换“的问题，我在最初也有遇见过，这是我的解决方案，不知道是否能够帮助到你：首先，我优化了数据集的质量，我选取了250条10s~20s的纯人声语音，最大程度的剔除了背景杂音与不够清晰的人声。数据集的质量越高，最后模型的效果越好。（最初，我只选取了40条15s的歌唱干声，数据集存在失真、不够清晰、伴奏未完全去除、存在部分和音的问题，优化后的数据集舍弃了所有这类数据）。其次，用于推理的音频本身也需要保障高质量，存在和声未去除干净或高音不清晰的音频会严重影响推理结果。最后，我的推理均采用了默认参数，为了保障契合音色，没有采用聚合模型。补充说明：我训练的迭代次数是12000次，并没有与其他次数的模型做对比。希望我个人的经历可以帮助到您！

注：我采用的人声提纯项目为：https://github.com/Anjok07/ultimatevocalremovergui.git

MuruganR96 added the help wanted The issue author is asking for help label Mar 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help]: 4.0 不工作。在转换后的音频中引入不需要的失真。源音高未正确转换。4.0 Not working. introducing unwanted distortion in converted audio. source pitch not properly converted. #89

[Help]: 4.0 不工作。在转换后的音频中引入不需要的失真。源音高未正确转换。4.0 Not working. introducing unwanted distortion in converted audio. source pitch not properly converted. #89

MuruganR96 commented Mar 26, 2023 •

edited

NaruseMioShirakana commented Mar 28, 2023

MuruganR96 commented Mar 28, 2023

MuruganR96 commented Mar 28, 2023 •

edited

NaruseMioShirakana commented Mar 31, 2023

xiancaoro commented Apr 5, 2023

xiancaoro commented Apr 5, 2023

[Help]: 4.0 不工作。在转换后的音频中引入不需要的失真。源音高未正确转换。4.0 Not working. introducing unwanted distortion in converted audio. source pitch not properly converted. #89

[Help]: 4.0 不工作。在转换后的音频中引入不需要的失真。源音高未正确转换。4.0 Not working. introducing unwanted distortion in converted audio. source pitch not properly converted. #89

Comments

MuruganR96 commented Mar 26, 2023 • edited

Please check the checkboxes below.

OS version

GPU

Python version

PyTorch version

Branch of sovits

Dataset source (Used to judge the dataset quality)

Where thr problem occurs or what command you executed

Problem description

Log

Screenshot so-vits-svc and logs/44k folders and paste here

SOURCE

CONVERTED

Supplementary description

NaruseMioShirakana commented Mar 28, 2023

MuruganR96 commented Mar 28, 2023

MuruganR96 commented Mar 28, 2023 • edited

NaruseMioShirakana commented Mar 31, 2023

xiancaoro commented Apr 5, 2023

xiancaoro commented Apr 5, 2023

MuruganR96 commented Mar 26, 2023 •

edited

Screenshot `so-vits-svc` and `logs/44k` folders and paste here

MuruganR96 commented Mar 28, 2023 •

edited