Skip to content
/ UnAV Public

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)

License

Notifications You must be signed in to change notification settings

ttgeng233/UnAV

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline

[Project page] [ArXiv] [Dataset(Google drive)] [Dataset(Baidu drive)] [Benchmark]

This repository contains code for CVPR 2023 paper "Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline". This paper introduces the first Untrimmed Audio-Visual (UnAV-100) dataset and proposes to sovle audio-visual event localization problem in more realistic and challenging scenarios.

Requirements

The implemetation is based on PyTorch. Follow INSTALL.md to install required dependencies.

Data preparation

The proposed UnAV-100 dataset can be downloaded from [Project Page], including YouTube links of raw videos, annotations and extracted features.

If you want to use your own choices of video features, you can download the raw videos from this link (Baidu Drive, pwd: qslx). A download script is also provided for raw videos at scripts/video_download.py.

Note: after downloading data, unpack files under data/unav100. The folder structure should look like:

This folder
│   README.md
│   ...  
└───data/
│    └───unav100/
│    	 └───annotations/
│               └───unav100_annotations.json
│    	 └───av_features/   
│               └───__2MwJ2uHu0_flow.npy    # mix all features together
│               └───__2MwJ2uHu0_rgb.npy 
│               └───__2MwJ2uHu0_vggish.npy 
|                   ...
└───libs
│   ...

Training

Run train.py to train the model on UnAV-100 dataset. This will create an experiment folder under ./ckpt that stores training config, logs, and checkpoints.

python ./train.py ./configs/avel_unav100.yaml --output reproduce

Evaluation

Run eval.py to evaluate the trained model.

python ./eval.py ./configs/avel_unav100.yaml ./ckpt/avel_unav100_reproduce

[Optional] We also provide a pretrained model for UnAV-100, which can be downloaded from this link.

Citation

If you find our dataset and code are useful for your research, please cite our paper

@inproceedings{geng2023dense,
  title={Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline},
  author={Geng, Tiantian and Wang, Teng and Duan, Jinming and Cong, Runmin and Zheng, Feng},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={22942--22951},
  year={2023}
}

Acknowledgement

The video features of I3D-rgb & flow and Vggish-audio were extracted using video_features. Our baseline model was implemented based on ActionFormer. We thank the authors for sharing their codes. If you use our code, please consider to cite their works.

About

Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline (CVPR 2023)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published