Video Generation from Single Semantic Label Map

	Paper accepted at 2019 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)


Junting Pan	Chengyu Wang	Xu Jia	Jing Shao	Lu Sheng	Junjie Yan	Xiaogang Wang

Abstract

This paper proposes the novel task of video generation conditioned on a SINGLE semantic label map, which provides a good balance between flexibility and quality in the generation process. Different from typical end-to-end approaches, which model both scene content and dynamics in a single step, we propose to decompose this difficult task into two sub-problems. As current image generation methods do better than video generation in terms of detail, we synthesize high quality content by only generating the first frame. Then we animate the scene based on its semantic meaning to obtain the temporally coherent video, giving us excellent results overall. We employ a cVAE for predicting optical flow as a beneficial intermediate step to generate a video sequence conditioned on the initial single frame. A semantic label map is integrated into the flow prediction module to achieve major improvements in the image-to-video generation process. Extensive experiments on the Cityscapes dataset show that our method outperforms all competing methods.

Publication

Find our work on arXiv.

Please cite with the following Bibtex code:

@article{pan2019video,
  title={Video Generation from Single Semantic Label Map},
  author={Pan, Junting and Wang, Chengyu and Jia, Xu and Shao, Jing and Sheng, Lu and Yan, Junjie and Wang, Xiaogang},
  journal={arXiv preprint arXiv:1903.04480},
  year={2019}
}

You may also want to refer to our publication with the more human-friendly Chicago style:

Junting Pan, Chengyu Wang, Xu Jia, Jing Shao, Lu Sheng, Junjie Yan and Xiaogang Wang. "Video Generation from Single Semantic Label Map." CVPR 2019.

Models

The Seg2Vid presented in our work can be downloaded from the links provided below the figure:

Seg2Vid Architecture

Img2Vid Architecture

Visual Results

Cityscapes (Generation)

Cityscapes (Prediction given the 1st frame and its segmetation mask)

Cityscapes 24 frames (Prediction given the 1st frame and its segmetation mask)

UCF-101 (Prediction given the 1st frame)

KTH (Prediction given the 1st frame)

Getting Started

Dataset

Cityscapes

Cityscapes dataset can be downloaded from the official website (registration required).
We apply Deeplab-V3 github-repo to get the corresponding semantic maps.
We organize the dataset following as below:

seg2vid
├── authors 
├── figs
├── gifs
├── logos
├── pretrained_models
├── src
├── data
│   ├── cityscapes
│   │   ├── leftImg8bit_sequence
│   │   │   ├── train_256x128
│   │   │   │   ├── aachen
│   │   │   │   │   ├── aachen_000003_000019_leftImg8bit.png
│   │   │   ├── val_256x2128
│   │   │   ├── val_pix2pixHD
│   │   │   │   ├── frankfurt
│   │   │   │   │   ├── frankfurt_000000_000294_pix2pixHD.png
│   │   │   ├── train_semantic_segmask
│   │   │   ├── val_semantic_segmask
│   │   │   │   ├── frankfurt
│   │   │   │   │   ├── frankfurt_000000_000275_ssmask.png
│   │   ├── gtFine
│   │   │   ├── train
│   │   │   ├── val
│   │   │   │   ├── frankfurt
│   │   │   │   │   ├── frankfurt_000000_000294_gtFine_instanceIds.png

KTH
- We use the KTH human action dataset dataset, and we follow the data processing in svg.
UCF-101
- UCF-101 dataset can be downloader from the official website

Testing

  python -u test_refine_w_mask_two_path.py --suffix refine_w_mask_two_path --dataset cityscapes_two_path

Training

  python -u train_refine_multigpu_w_mask_two_path.py --batch_size 8 --dataset cityscapes_two_path

Seg2Vid on Pytorch

Seg2Vid is implemented in Pytorch.

Contact

If you have any general doubt about our work or code which may be of interest for other researchers, please use the public issues section on this github repo. Alternatively, drop us an e-mail at mailto:junting.pa@gmail.com.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

authors

authors

figs

figs

gifs

gifs

logos

logos

src

src

README.md

README.md

Repository files navigation

Video Generation from Single Semantic Label Map

Abstract

Publication

Models

Visual Results

Cityscapes (Generation)

Cityscapes (Prediction given the 1st frame and its segmetation mask)

Cityscapes 24 frames (Prediction given the 1st frame and its segmetation mask)

UCF-101 (Prediction given the 1st frame)

KTH (Prediction given the 1st frame)

Getting Started

Dataset

Testing

Training

Seg2Vid on Pytorch

Contact

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
authors		authors
figs		figs
gifs		gifs
logos		logos
src		src
README.md		README.md

STVIR/seg2vid

Folders and files

Latest commit

History

Repository files navigation

Video Generation from Single Semantic Label Map

Abstract

Publication

Models

Visual Results

Cityscapes (Generation)

Cityscapes (Prediction given the 1st frame and its segmetation mask)

Cityscapes 24 frames (Prediction given the 1st frame and its segmetation mask)

UCF-101 (Prediction given the 1st frame)

KTH (Prediction given the 1st frame)

Getting Started

Dataset

Testing

Training

Seg2Vid on Pytorch

Contact

About

Resources

Stars

Watchers

Forks

Languages