Self-Supervised Learning by Cross-Modal Audio-Video Clustering

This repository holds the pretrained models for the Cross-Modal Deep Clustering (XDC) method presented as a spotlight in NeurIPS 2020.

Self-Supervised Learning by Cross-Modal Audio-Video Clustering. Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, Du Tran. In NeurIPS, 2020.

Load Pretrained Models

We provide the following pretrained R(2+1)D-18 video models. We report the average top-1 video-level accuracy over all splits on UCF101 and HMDB51 after full-finetuning.

Pretraining Name	Description	UCF101	HMDB51	Weights
`r2plus1d_18_xdc_ig65m_kinetics`	XDC pretrained on IG-Kinetics	95.5	68.9	[PyTorch] [Caffe2]
`r2plus1d_18_xdc_ig65m_random`	XDC pretrained on IG-Random	94.6	66.5	[PyTorch] [Caffe2]
`r2plus1d_18_xdc_audioset`	XDC pretrained on AudioSet	93.0	63.7	[PyTorch] [Caffe2]
`r2plus1d_18_fs_kinetics`	fully-supervised pretraining on Kinetics	94.2	65.1	[PyTorch] [Caffe2]
`r2plus1d_18_fs_imagenet`	fully-supervised pretraining on ImageNet	84.0	48.1	[PyTorch] [Caffe2]

There are two ways to load the XDC pretrained models in PyTorch: (1) via PyTorch Hub or (2) via source code.

Via PyTorch Hub (Recommended)

⚠️ [Known Issue] Using this way to load XDC models breaks for torchvision v0.13 or higher due to backward incompatible changes introduced in torchvision. Please make sure to use trochvision v0.12 or earlier when loading XDC models via the torch.hub.load() API. Loading models via source code still works as expected.

You can load all our pretrained models using torch.hub.load() API.

import torch

model = torch.hub.load('HumamAlwassel/XDC', 'xdc_video_encoder', 
                        pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                        num_classes=42)

Use the parameter pretraining to specify the pretrained model to load from the table above (default pretrained model is r2plus1d_18_xdc_ig65m_kinetics). Pretrained weights of all layers except the FC classifier layer are loaded. The FC layer (of size 512 x num_classes) is randomly-initialized. Specify the keyword argument num_classes based on your application (default is 400). Run print(torch.hub.help('HumamAlwassel/XDC', 'xdc_video_encoder')) for the model documentation. Learn more about PyTorch Hub here.

Via Source Code

Clone this repo and create the conda environment.

git clone https://github.com/HumamAlwassel/XDC.git
cd XDC
conda env create -f environment.yml
conda activate xdc

Load the pretrained models from the file xdc.py.

from xdc import xdc_video_encoder

model = xdc_video_encoder(pretraining='r2plus1d_18_xdc_ig65m_kinetics',
                          num_classes=42)

Feature Extraction and Model Finetuning

Please refer to the Facebook Video Model Zoo (VMZ) repo for PyTorch/Caffe2 scripts for feature extraction and model finetuning on datasets such as UCF101 and HMDB51.

Please cite this work if you find XDC useful for your research.

@inproceedings{alwassel_2020_xdc,
  title={Self-Supervised Learning by Cross-Modal Audio-Video Clustering},
  author={Alwassel, Humam and Mahajan, Dhruv and Korbar, Bruno and 
          Torresani, Lorenzo and Ghanem, Bernard and Tran, Du},
  booktitle={Advances in Neural Information Processing Systems (NeurIPS)},
  year={2020}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
img		img
model		model
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml
hubconf.py		hubconf.py
xdc.py		xdc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

img

img

model

model

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

environment.yml

environment.yml

hubconf.py

hubconf.py

xdc.py

xdc.py

Repository files navigation

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Load Pretrained Models

Via PyTorch Hub (Recommended)

Via Source Code

Feature Extraction and Model Finetuning

About

Releases 1

Packages

Languages

License

HumamAlwassel/XDC

Folders and files

Latest commit

History

Repository files navigation

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Load Pretrained Models

Via PyTorch Hub (Recommended)

Via Source Code

Feature Extraction and Model Finetuning

About

Resources

License

Stars

Watchers

Forks

Languages