Skip to content

tanjimin/unsupervised-video-dubbing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

94 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Unsupervised Generative Video Dubbing

Author: Jimin Tan, Chenqin Yang, Yakun Wang, Yash Deshpande

Project Website: https://tanjimin.github.io/unsupervised-video-dubbing/

Training Code for the dubbing model is under the root directory. We used a pre-processed LRW for training. See data.py for details.

We created a simple depolyment pipeline which can be find under post_processing subdirectory. The pipeline takes the model weights we pre-trained on LRW. The pipeline takes a video and a equal duration audio segments and output a dubbed video based on audio information. See the instruction below for more details.

Requirement

  • LibROSA 0.7.2

  • dlib 19.19

  • OpenCV 4.2.0

  • Pillow 6.2.2

  • PyTorch 1.2.0

  • TorchVision 0.4.0

Post-Procesing Folder

.
├── source                  
│   ├── audio_driver_mp4    # contain audio drivers (saved in mp4 format)
│   ├── audio_driver_wav    # contain audio drivers (saved in wav format)
│   ├── base_video          # contain base videos (videos you'd like to modify)
│   ├── dlib            		# trained dlib models
│   └── model               # trained landmark generation models
├── main.py									# main function for post processing
├── main_support.py					# support functions used in main.py
├── models.py								# define the landmark generation model
├── step_3_vid2vid.sh		  	# Bash script for running vid2vid
├── step_4_denoise.sh.      # Bash script for denoising vid2vid results
├── compare_openness.ipynb  # mouth openness comparison across generated videos
└── README.md
  • shape_predictor_68_face_landmarks.dat

This is trained on the ibug 300-W dataset (https://ibug.doc.ic.ac.uk/resources/facial-point-annotations/)

The license for this dataset excludes commercial use and Stefanos Zafeiriou, one of the creators of the dataset, asked me to include a note here saying that the trained model therefore can't be used in a commerical product. So you should contact a lawyer or talk to Imperial College London to find out if it's OK for you to use this model in a commercial product.

{C. Sagonas, E. Antonakos, G, Tzimiropoulos, S. Zafeiriou, M. Pantic. 300 faces In-the-wild challenge: Database and results. Image and Vision Computing (IMAVIS), Special Issue on Facial Landmark Localisation "In-The-Wild". 2016.}

Detailed steps for model deployment

  • Go to post_processing directory
  • Run: python3 main.py -r step (corresponding step below)
    • e.g: python3 main.py -r 1 will run the first step and etc

Step 1 — generate landmarks

  • Input
    • Base video file path (./source/base_video/base_video.mp4)
    • Audio driver file path (./source/audio_driver_wav/audio_driver.wav)
    • Epoch (int)
  • Output (./result)
    • keypoints.npy (# generated landmarks in npy format)
    • source.txt (contains information about base video, audio driver, model epoch)
  • Process
    • Extract facial landmarks from base video
    • Extract MFCC features from driver audio
    • Pass MFCC features and facial landmarks into the model to retrieve mouth landmarks
    • Combine facial & mouth landmarks and save in npy format

Step 2 — Test generated frames

  • Input
    • None
  • Output (./result)
    • Folder — save_keypoints: visualized generated frames
    • Folder — save_keypoints_csv : landmark coordinates for each frame, saved in txt format
    • openness.png: mouth openness measured and plotted across all frames
  • Process
    • Generate images from npy file
    • Generate openness plot

Step 3 — Execute vid2vid

  • Input
    • None
  • Output
    • Path for generated fake images from vid2vid are shown at the end; Please copy it back to the /result/vid2vid_frames/
      • Folder: vid2vid generated images
  • Process
    • Run vid2vid
    • Copy back vid2vid results to main folder

Step 4 — Denoise and smooth vid2vid results

  • Input
    • vid2vid generated images folder path
    • Original base images folder path
  • Output
    • Folder: Modified images (base image + vid2vid mouth regions)
    • Folder: Denoised and smoothed frames
  • Process
    • Crop mouth areas from vid2vid generated images and paste them back to base images —> modified image
    • Generate circular smoothed images by using gradient masking
    • Take (modified image, circular smoothed images) as pairs and do denoising

Step 5 — Generate modified videos with sound

  • Input
    • Saved frames folder path
      • By default, it is saved in ./result/save_keypoints; you can enter d to go with default path
      • Otherwise, input the frames folder path
    • Audio driver file path (./source/audio_driver_wav/audio_driver.wav)
  • Output (./result/save_keypoints/result/)
    • video_without_sound.mp4: modified videos without sound
    • audio_only.mp4: audio driver
    • final_output.mp4: modified videos with sound
  • Process
    • Generate the modified video without sound with define fps
    • Extract wav from audio driver
    • Combine audio and video to generate final output

Important Notice

  • You may need to modify how MFCC features are extracted in extract_mfcc function
  • You may need to modify the region of interest (mouth area) in frame_crop function
  • You may need to modify the frame rate defined in step_3 of the main.py, which should be your base video fps
# How to check your base video fps
# source: https://www.learnopencv.com/how-to-find-frame-rate-or-frames-per-second-fps-in-opencv-python-cpp/

import cv2
video = cv2.VideoCapture("video.mp4");

# Find OpenCV version
(major_ver, minor_ver, subminor_ver) = (cv2.__version__).split('.')
if int(major_ver)  < 3 :
    fps = video.get(cv2.cv.CV_CAP_PROP_FPS)
    print("Frames per second using video.get(cv2.cv.CV_CAP_PROP_FPS): {0}".format(fps))
else :
    fps = video.get(cv2.CAP_PROP_FPS)
    print("Frames per second using video.get(cv2.CAP_PROP_FPS) : {0}".format(fps))
video.release()
  • You may need to modify the shell path
echo $SHELL

Update History

  • March 22, 2020: Drafted documentation

About

Unsupervised video dubbing project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published