Skip to content

marcromani/voice2voice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 

Repository files navigation

voice2voice

Parallel data voice conversion based on the pix2pix architecture.

License: MIT

Summary

Non-conditional GAN system (neither the generator nor the discriminator are conditioned) based on the pix2pix architecture. The aim is to reconstruct the speech of a source speaker with the voice of a target speaker. The models are not conditioned because it is not possible to learn a meaningful mapping given the (non-linear) audio misalignments due to, for example, source and target speakers speaking at different speeds.

Data

We trained and tested the system with the Voice Conversion Challenge 2018 data. For a (source, target) pair of audio samples (from two different people uttering the same speech) we compute their Mel spectograms so that each one of them is a single-channel 256x256 image. These are the inputs of both the generator and the discriminator.

Source Target

Note how the data is misaligned. The speakers have a different cadence while speaking. Sometimes there's even a pause in one of the samples but not in the other. Click on the image to download the audio.

Details

The architecture and training hyperparameters are the same as in the original paper, but we replaced the batch normalization layers by instance normalization layers both in the generator and the discriminator, as suggested here. Also, we use mean squared error as the adversarial loss, as suggested here.

Dependencies

Examples

Source Target Fake
Source Target Fake
Source Target Fake

Releases

No releases published

Packages

No packages published