Skip to content

joannahong/Speech-Reconstruction-with-Reminiscent-Sound-via-Visual-Voice-Memory

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 

Repository files navigation

Speech Reconstruction with Reminiscent Sound via Visual Voice Memory

PWC

Overview

This repository contains the video demo and the audio samples of IEEE TASLP submitted paper titled "Speech Reconstruction with Reminiscent Sound via Visual Voice Memory"

Demo video

Each demo video contains the original speech, the generated speech from previous work [1], and the generated speech from the proposed method from four different speakers. The video demos are also availalbe in here.

Performances

The objective measurements of the test samples of each setting (speaker dependent, multi speaker dependent, multi speaker independent) are listed below.

STOI ESTOI PESQ
Speaker-dependent (w/ griffin-lim) 0.738 0.579 1.984
Speaker-dependent (w/ wavenet vocoder) 0.737 0.578 1.984
Multi-speaker-dependent 0.754 0.602 2.112
Multi-speaker-independent 0.600 0.315 1.332

Audio samples

The actual generated audio samples are available in here.

Directory structure

The directory is composed with the following architecture

  • audio-samples/speaker-dependent

    The results on a single-speaker dependent setting on GRID

    ├── s1_sgib9s (for example) : The results of Speaker 1 saying sgib9s (Set Green In B 9 Soon)
    |	├── ground_truth : ground truth audio
    |	├── lip2wav [1] : results of the previous state-of-the-art method[1] using Griffin-Lim
    |	├── ours_griffin_lim : results of the proposed method using Griffin-Lim
    |	├── ours_wavenet : results of the proposed method using WaveNet vocoder 
          (This is only for speaker-dependent setting to verify our proposed method works for other vocoder)
    
  • audio-samples/multi-speaker-dependent

    The results on multi-speaker dependent setting on GRID

    • same sub-folder architecture as the speaker-dependent
    • same architecture as speaker-dependent, except that we are not having audio from wavenet vocoder.
  • audio-samples/multi-speaker-independent

    The results on multi-speaker independent setting on GRID

    • same sub-folder architecture as the speaker-dependent
    • same architecture as multi-speaker-dependent

GRID dataset dictionary

Command Color Preposition Letter Digit Adverb
bin blue at A-Z 0-9 again
lay green by minus W now
place red in please
set white with soon

References

[1] K.Prajwal, R.Mukhopadhyay, V.P.Namboodiri, and C.Jawahar, “Learning individual speaking styles for accurate lip to speech synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.13 796–13 8

About

Demo of IEEE TASLP submitted paper titled "Speech Reconstruction with Reminiscent Sound via Visual Voice Memory"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published