Skip to content
This repository has been archived by the owner on Nov 11, 2023. It is now read-only.

[Help]: 4.0 不工作。在转换后的音频中引入不需要的失真。源音高未正确转换。4.0 Not working. introducing unwanted distortion in converted audio. source pitch not properly converted. #89

Open
3 tasks done
MuruganR96 opened this issue Mar 26, 2023 · 6 comments
Labels
help wanted The issue author is asking for help

Comments

@MuruganR96
Copy link

MuruganR96 commented Mar 26, 2023

Please check the checkboxes below.

  • I have read README.md and Quick solution in wiki carefully.
  • I have been troubleshooting issues through various search engines. The questions I want to ask are not common.
  • I am NOT using one click package / environment package.

OS version

Linux e2e-99-151 4.15.0-206-generic #217-Ubuntu SMP Fri Feb 3 19:10:13 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

GPU

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13 Driver Version: 525.60.13 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:01:01.0 Off | On |
| N/A 36C P0 93W / 300W | 10627MiB / 81920MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 8324MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 4 0 1 | 6MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 5 0 2 | 6MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 0MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 6 0 3 | 2290MiB / 19968MiB | 14 0 | 1 0 1 0 0 |
| | 2MiB / 32767MiB | | |
+------------------+----------------------+-----------+-----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 3 0 16814 C /root/anaconda3/bin/python 8312MiB |
| 0 6 0 5768 C python3 2276MiB |
+-----------------------------------------------------------------------------+

Python version

Python 3.8.16

PyTorch version

Name: torch Version: 1.13.1 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /root/anaconda3/envs/SOVITS/lib/python3.8/site-packages Requires: nvidia-cublas-cu11, nvidia-cuda-nvrtc-cu11, nvidia-cuda-runtime-cu11, nvidia-cudnn-cu11, typing-extensions Required-by: fairseq, torchaudio, torchvision, triton

Branch of sovits

4.0(Default)

Dataset source (Used to judge the dataset quality)

Recorded in recording studio

Where thr problem occurs or what command you executed

python inference_main.py -m "logs/44k/G_200000.pth" -c "configs/config.json" -n "source.wav" -t 0 -s "aki" -a -cr 0.5

Problem description

introducing unwanted distortion in converted audio. source pitch not properly converted.

4.0 version not working properly. Please help me

4.0 不工作。在转换后的音频中引入不需要的失真。源音高未正确转换。

请帮我

Log

python inference_main.py -m "logs/44k/G_200000.pth" -c "configs/config.json" -n "source.wav" -t 0 -s "aki" -a -cr 0.5

load model(s) from hubert/checkpoint_best_legacy_500.pt                                                                 
INFO:fairseq.tasks.text_to_speech:Please install tensorboardX: pip install tensorboardX                                 
INFO:fairseq.tasks.hubert_pretraining:current directory is /root/Experiments/NewExperiments/so-vits-svc-4.0-mean-spk-emb
INFO:fairseq.tasks.hubert_pretraining:HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', '
fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': Fals
e, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target':
 False, 'random_crop': True, 'pad_audio': False}                                                                        
INFO:fairseq.models.hubert.hubert:HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default,
 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activati
on_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_l
ayerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_
first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_tem
p': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, '
mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'm
ask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1
, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': Fals
e, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', '
pos_enc_type': 'abs', 'fp16': False}                                                                                    
load                                                                                                                    
INFO:root:Loaded checkpoint 'logs/44k/G_200000.pth' (iteration 574)                                                     
spk_list ===>  ['sai_dharam_tej']                                                                                       
#=====segment start, 5.82s======                                                                                        
vits use time:1.1171600818634033                                                                                        
#=====segment start, 3.4s======                                                                                         
vits use time:0.19401907920837402                                                                                       
#=====segment start, 0.025s======                                                                                       
jump empty segment                                                                                                      
#=====segment start, 4.6s======                                                                                         
vits use time:0.269317626953125                                                                                         
#=====segment start, 5.0s======                                                                                         
vits use time:0.1455228328704834                                                                                        
#=====segment start, 4.28s======                                                                                        
vits use time:0.12343764305114746                                                                                       
#=====segment start, 0.012s======                                                                                       
jump empty segment                                                                                                      
#=====segment start, 8.92s======                                                                                        
vits use time:0.17324042320251465                                                                                       
#=====segment start, 5.66s======                                                                                        
vits use time:0.14951491355895996                                                                                       
#=====segment start, 1.66s======                                                                                        
vits use time:0.1753239631652832                                                                                        
#=====segment start, 0.013s======                                                                                       
jump empty segment                                                                                                      
#=====segment start, 7.78s======                                                                                        
vits use time:0.14623379707336426                                                                                       
#=====segment start, 6.74s======                                                                                        
vits use time:0.12516403198242188                                                                                       
#=====segment start, 0.009s======                                                                                       
jump empty segment                                                                                                      
#=====segment start, 5.1s======                                                                                         
vits use time:0.18128347396850586                                                                                       
#=====segment start, 0.002s======                                                                                       
jump empty segment
#=====segment start, 6.12s======
vits use time:0.1354503631591797
#=====segment start, 6.46s======
vits use time:0.1435403823852539
#=====segment start, 7.08s======
vits use time:0.12348651885986328
#=====segment start, 4.66s======
vits use time:0.1376965045928955
#=====segment start, 5.56s======
vits use time:0.1471116542816162
#=====segment start, 0.013s======
jump empty segment
#=====segment start, 7.1s======
vits use time:0.18711566925048828
#=====segment start, 0.01s======
jump empty segment
#=====segment start, 5.484s======
vits use time:0.13010954856872559
#=====segment start, 6.54s======
vits use time:0.11691427230834961
#=====segment start, 8.96s======
vits use time:0.20655536651611328
#=====segment start, 6.52s======
vits use time:0.17476463317871094
#=====segment start, 5.14s======
vits use time:0.18201518058776855
#=====segment start, 5.56s======
vits use time:0.10942959785461426
#=====segment start, 13.42s======
vits use time:0.26901769638061523
#=====segment start, 0.009s======
jump empty segment

Screenshot so-vits-svc and logs/44k folders and paste here

SOURCE

source_sovits

CONVERTED

converted_highlighted

Supplementary description

Almost all the converted samples pitch not converted properly. it is a major issue. please check this immediately friends

几乎所有转换后的样本音高都没有正确转换。这是一个重大问题。请朋友们立即查看

@MuruganR96 MuruganR96 added the help wanted The issue author is asking for help label Mar 26, 2023
@NaruseMioShirakana
Copy link
Contributor

试着去设置一下noise_scale和seed

@MuruganR96
Copy link
Author

@NaruseMioShirakana 非常感谢您的精彩回复

By default taking noice_scale=0.4 and seed means which argument do I need to set?

默认情况下采用 noice_scale=0.4 和 seed 意味着我需要设置哪个参数?

@MuruganR96
Copy link
Author

MuruganR96 commented Mar 28, 2023

@NaruseMioShirakana 我尝试使用 noise_scale 0.0, 0.1, 0.4, 0.7, 1.0。但仍然是同样的问题。源音调未正确转换。失真发生

@NaruseMioShirakana @Erythrocyte3803 请帮助我,还有什么造成了问题?而且我觉得自然度不好。转换后的样本有点像机器人的味道。

I tried with noise_scale 0.0, 0.1, 0.4, 0.7, 1.0. but still the same issue. The source pitch is not converted correctly. distortion occurring

@NaruseMioShirakana @Erythrocyte3803 please help me, what else creating the problem? and I feel the naturalness is not good. converted samples coming kind of robotic flavor.

@NaruseMioShirakana
Copy link
Contributor

@NaruseMioShirakana 我尝试使用 noise_scale 0.0, 0.1, 0.4, 0.7, 1.0。但仍然是同样的问题。源音调未正确转换。失真发生

@NaruseMioShirakana @Erythrocyte3803 请帮助我,还有什么造成了问题?而且我觉得自然度不好。转换后的样本有点像机器人的味道。

I tried with noise_scale 0.0, 0.1, 0.4, 0.7, 1.0. but still the same issue. The source pitch is not converted correctly. distortion occurring

@NaruseMioShirakana @Erythrocyte3803 please help me, what else creating the problem? and I feel the naturalness is not good. converted samples coming kind of robotic flavor.

可以发一个示例音频吗

@xiancaoro
Copy link

关于”不需要的失真“以及”高音未正常转换“的问题,我在最初也有遇见过,这是我的解决方案,不知道是否能够帮助到你:
首先,我优化了数据集的质量,我选取了250条10s~20s的纯人声语音,最大程度的剔除了背景杂音与不够清晰的人声。数据集的质量越高,最后模型的效果越好。(最初,我只选取了40条15s的歌唱干声,数据集存在失真、不够清晰、伴奏未完全去除、存在部分和音的问题,优化后的数据集舍弃了所有这类数据)。
其次,用于推理的音频本身也需要保障高质量,存在和声未去除干净或高音不清晰的音频会严重影响推理结果。
最后,我的推理均采用了默认参数,为了保障契合音色,没有采用聚合模型。
补充说明:我训练的迭代次数是12000次,并没有与其他次数的模型做对比。
希望我个人的经历可以帮助到您!

@xiancaoro
Copy link

关于”不需要的失真“以及”高音未正常转换“的问题,我在最初也有遇见过,这是我的解决方案,不知道是否能够帮助到你: 首先,我优化了数据集的质量,我选取了250条10s~20s的纯人声语音,最大程度的剔除了背景杂音与不够清晰的人声。数据集的质量越高,最后模型的效果越好。(最初,我只选取了40条15s的歌唱干声,数据集存在失真、不够清晰、伴奏未完全去除、存在部分和音的问题,优化后的数据集舍弃了所有这类数据)。 其次,用于推理的音频本身也需要保障高质量,存在和声未去除干净或高音不清晰的音频会严重影响推理结果。 最后,我的推理均采用了默认参数,为了保障契合音色,没有采用聚合模型。 补充说明:我训练的迭代次数是12000次,并没有与其他次数的模型做对比。 希望我个人的经历可以帮助到您!

注:我采用的人声提纯项目为:https://github.com/Anjok07/ultimatevocalremovergui.git

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted The issue author is asking for help
Projects
None yet
Development

No branches or pull requests

3 participants