Speech Driven Tongue Animation

Advances in speech driven animation techniques now allow creating convincing animations of virtual characters solely from audio data. While many approaches focus on facial and lip motion, they often do not provide realistic animation of the inner mouth. Performance or motion capture of the tongue and jaw from video alone is difficult because the inner mouth is only partially observable during speech. In this work, we collected a large-scale speech to tongue mocap dataset that focuses on capturing tongue, jaw, and lip motion during speech. This dataset enables research on data-driven techniques for realistic inner mouth animation. We present a method that leverages recent deep-learning based audio feature representations to build a robust and generalizable speech to animation pipeline. We find that self-supervised deep learning based audio feature encoders are robust and generalize well to unseen speakers and content.

Links: [Project] | [Paper] | [Video] | [Data]

Data

The data can be downloaded from this link. The dataset includes:

Mono audio in wav format with a sample rate of 16 kHz
EMA 3D landmark sequences @ 50 FPS
Audio transcripts

Code

👷👷👷 UNDER CONSTRUCTION 👷👷👷

Installation

Conda Environment

Create the conda environment from the yaml file envs/tongueanim.yaml

conda create -f envs/tongueanim.yaml

Wav2Vec

Our best model uses Wav2Vec audio features. For this you need to download the model from the Fairseq repository and place it under the models/ folder.

Pipeline

Our pipeline consists of the following stage:

Extract audio features from wav2vec model
Build the dataset to train the model
Train the landmark prediction model
Evaluate the model
Visualize the model

1. Audio Feature Extraction

2. Building the dataset

3. Training the model

4. Testing the model

5. Visualizing the results

Citation

If you find this work useful on your research, please cite our work:

@inproceedings{medina2022speechtongue,
  title={Speech Driven Tongue Animation},
  author={Medina, Salvador and Tomé, Denis and Stoll, Carsten and Tiede, Mark and Munhall, Kevin and Hauptmann, Alex and Matthews, Iain},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2022},
  organization={IEEE/CVF}
}

License

Our code is released under MIT License.

The license agreement for the data usage implies citation of the paper. Please notice that citing the dataset URL instead of the publication would not be compliant with this license agreement.

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
images		images
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Speech Driven Tongue Animation

Data

Code

Installation

Conda Environment

Wav2Vec

Pipeline

1. Audio Feature Extraction

2. Building the dataset

3. Training the model

4. Testing the model

5. Visualizing the results

Citation

License

About

Releases

Packages

Contributors 2

Languages

License

salmedina/SpeechDrivenTongueAnimation

Folders and files

Latest commit

History

Repository files navigation

Speech Driven Tongue Animation

Data

Code

Installation

Conda Environment

Wav2Vec

Pipeline

1. Audio Feature Extraction

2. Building the dataset

3. Training the model

4. Testing the model

5. Visualizing the results

Citation

License

About

Resources

License

Stars

Watchers

Forks

Languages