Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The replicate results don't match the demo. #21

Open
quizt35 opened this issue May 11, 2024 · 3 comments
Open

The replicate results don't match the demo. #21

quizt35 opened this issue May 11, 2024 · 3 comments

Comments

@quizt35
Copy link

quizt35 commented May 11, 2024

Hello! Thanks for sharing the pre-trained models and demos.
I would like to replicate the demo results using a pretrained model. I used the data from the first row of the double-talk and converted the mp3 to wav format (single channel, 16000Hz, 16bit) for convenience. Based on the speech titles downloaded from the demo page, I selected the same pkl file to process the original speech. However, there is a significant difference between the spectrograms from the demo page and those generated using the pre-trained model. I've checked every steps and can't find the reason. Could you help me understand why?

1715424116578 1715424077930

model tag: v1.0.1
This code i used is below:

import os
from aec_eval import get_system_ckpt
import numpy as np
import librosa
import soundfile as sf

ckpt_dir = "v1.0.1_models/aec/"
name = "meta_aec_16_combo_rl_4_1024_512_r2"
date = "2022_10_19_23_43_22"
epoch = 110

ckpt_loc = os.path.join(ckpt_dir, name, date)

system, kwargs, outer_learnable = get_system_ckpt(
    ckpt_loc,
    epoch,
)
fit_infer = system.make_fit_infer(outer_learnable=outer_learnable)
fs = 16000

out_dir = "metaAF_output"
os.makedirs(out_dir, exist_ok=True)

u, _ = librosa.load("u.wav", sr=fs)
d, _ = librosa.load("d.wav", sr=fs)
s, _ = librosa.load("s.wav", sr=fs)
e = d - s

d_input = {"u": u[None, :, None], "d": d[None, :, None],
           "s": s[None, :, None], "e": e[None, :, None]
           }
pred = system.infer({"signals": d_input, "metadata": {}}, fit_infer=fit_infer)[0]
pred = np.array(pred[0, :, 0])

sf.write(os.path.join(out_dir, f"_out.wav"), pred, fs)

Looking forward to hearing from you, thanks!

@jmcasebeer
Copy link
Collaborator

Hello and thanks for the question.

The demo files are all rescaled to [-1, 1] for playback (see website footnote), which is not how the AEC data was setup for training. A previous github issue here noted this issue as well and rescaled d = d / 10.

If you want to replicate my results fully, I would recommend downloading the data from the AEC challenge and using that.

@quizt35
Copy link
Author

quizt35 commented May 14, 2024

Thanks for your reply. By setting a scale, I can get a more reasonable result, but there are still some minor issues. As shown in the figure below, there are similar impulses in the first few seconds of the speech. I'm wondering if this is due to the window or the format of the original speech. I will also follow your suggestion to test on the AEC Challenge datasets.
1715654928043

@quizt35
Copy link
Author

quizt35 commented May 14, 2024

Additionally, should the URL for JAX in the ‘ReadMe - GPU Setup’ be https://storage.googleapis.com/jax-releases/jax_cuda_releases.html?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants