BEAR: a new BEnchmark on video Action Recognition

This repo contains the data and pre-trained models in "A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition"

Andong Deng*, Taojiannan Yang*, Chen Chen
Center for Research in Computer Vision, University of Central Florida

[CVF]

If you find our work useful in your research, please cite:

@article{deng2023BEAR,
  title={A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition},
  author={Deng, Andong and Yang, Taojiannan and Chen, Chen},
  journal={arXiv preprint arXiv:2303.13505},
  year={2023}
}

Updates

04/21/2024 Update HuggingFace link for pre-trained models.

08/08/2023 Update Dropbox link for pre-trained models.

07/17/2023 BEAR is accepted by ICCV 2023!

03/24/2023 Update Dropbox link for Mini-Sports1M.

03/23/2023 Initial commits

Introduction

The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations.

To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot finetuning, and unsupervised domain adaptation. Our observation suggests that current state-of-the-arts cannot solidly guarantee high performance on datasets close to real-world applications and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners.

The evaluation is extremely simple since we provide all scripts in this codebase. The users only need to download datasets and run the scripts provided.

Datasets

The following table includes all the statistics about the 18 datasets collected in BEAR:

Dataset	Domain	# Classes	# Clip	Avg Length (sec.)	Training data per class (min, max)	Split ratio	Video source	Video viewpoint
XD-Violence	Anomaly	5	4135	14.94	(36, 2046)	3.64:1	Movies, sports, CCTV, etc.	3rd, sur.
UCF Crime	Anomaly	12	600	132.51	38	3.17:1	CCTV Camera	3rd, sur.
MUVIM	Anomaly	2	1127	68.1	(296, 604)	3.96:1	Self-collected	3rd, sur.
WLASL100	Gesture	100	1375	1.23	(7, 20)	5.37:1	Sign language website	3rd
Jester	Gesture	27	133349	3	(3216, 9592)	8.02:1	Self-collected	3rd
UAV Human	Gesture	155	22476	5	(20, 114)	2:1	Self-collected	3rd, dro.
CharadesEgo	Daily	157	42107	10.93	(26, 1120)	3.61:1	YouTube	1st
Toyota Smarthome	Daily	31	14262	1.78	(23, 2312)	1.63:1	Self-collected	3rd, sur.
Mini-HACS	Daily	200	10000	2	50	4:1	YouTube	1st, 3rd
MPII Cooking	Daily	67	3748	153.04	(5, 217)	4.69:1	Self-collected	3rd
Mini-Sports1M	Sports	487	24350	10	50	4:1	YouTube	3rd
FineGym99	Sports	99	20389	1.65	(33, 951)	2.24:1	Competition videos	3rd
MOD20	Sports	20	2324	7.4	(73, 107)	2.29:1	YouTube and self-collected	3rd, dro.
COIN	Instructional	180	10426	37.01	(10, 63)	3.22:1	YouTube	1st, 3rd
MECCANO	Instructional	61	7880	2.82	(2, 1157)	1.79:1	Self-collected	1st
INHARD	Instructional	14	5303	1.36	(27, 955)	2.16:1	Self-collected	3rd
PETRAW	Instructional	7	9727	2.16	(122, 1262)	1.5:1	Self-collected	1st
MISAW	Instructional	20	1551	3.8	(1, 316)	2.38:1	Self-collected	1st

Datasets Download and Pre-processing

We provide downloading and pre-processing pipeline here for each dataset.

The HuggingFace link for part of BEAR datasets are here:

Mini-Sports1M Jester FineGym MOD20 MPII-Cooking2

Pre-trained Models

We prepare Kinetics-400 pre-trained models with both supervised and self-supervised pre-training:

The updated HuggingFace Link for both self-supervised pretraining and supervised pretraining are here:

SSL SUP

The pre-trained models can be downloaded from below if needed:

model	Supervised (Top-1 Accuracy)	Self-supervised (KNN evaluation)
TSN	77.6 Dropbox	43.1 Dropbox
TSM	76.4 Dropbox	43.2 Dropbox
I3D	74.2 Dropbox	51.3 Dropbox
NL	73.9 Dropbox	50.7 Dropbox
TimeSformer	75.8 Dropbox	50.3 Dropbox
VideoSwin	77.6 Dropbox	51.1 Dropbox

Benchmark

Based on pre-trained models on Kinetics400, we provide 4 types of evaluation paradigms in BEAR:

BEAR-Standard
BEAR-Fewshot
BEAR-Zeroshot
BEAR-UDA

Standard Finetuning

We build our stanard finetuning based on a popular video understanding toolbox MMAction2.

We provide specific training steps here.

The finetuning results of supervised pre-training are shown below:

Dataset	TSN	TSM	I3D	NL	TimeSformer	VideoSwin
XD-Violence	85.54	82.96	79.93	79.91	82.51	82.40
UCF-Crime	35.42	42.36	31.94	34.03	36.11	34.72
MUVIM	79.30	100	97.80	98.68	94.71	100
WLASL	29.63	43.98	49.07	52.31	37.96	45.37
Jester	86.31	95.21	92.99	93.49	93.42	94.27
UAV-Human	27.89	38.84	33.49	33.03	28.93	38.66
CharadesEGO	8.26	8.11	6.13	6.42	8.58	8.55
Toyota Smarthome	74.73	82.22	79.51	76.86	69.21	79.88
Mini-HACS	84.69	80.87	77.74	79.51	79.81	84.94
MPII Cooking	38.39	46.74	48.71	42.19	40.97	46.59
Mini-Sports1M	54.11	50.06	46.90	46.16	51.79	55.34
FineGym	63.73	80.95	72.00	71.21	63.92	65.02
MOD20	98.30	96.75	96.61	96.18	94.06	92.64
COIN	81.15	78.49	73.79	74.30	82.99	76.27
MECCANO	41.06	39.28	36.88	36.13	40.95	38.89
InHARD	84.39	88.08	82.06	86.31	85.16	87.60
PETRAW	94.30	95.72	94.84	94.54	94.30	96.43
MISAW	61.44	75.16	68.19	64.27	71.46	69.06

The finetuning results of self-supervised pre-training are shown below:

Dataset	TSN	TSM	I3D	NL	TimeSformer	VideoSwin
XD-Violence	80.49	81.73	80.38	80.94	77.47	77.91
UCF-Crime	37.50	35.42	34.03	34.72	36.11	34.03
MUVIM	99.12	100	66.96	66.96	99.12	100
WLASL	27.01	27.78	29.17	30.56	25.56	28.24
Jester	83.22	95.32	87.23	93.89	90.33	90.18
UAV-Human	15.70	30.75	31.95	26.28	21.02	35.12
CharadesEGO	6.29	6.59	6.24	6.31	7.59	7.65
Toyota Smarthome	68.71	81.34	77.82	76.16	61.64	80.18
Mini-HACS	64.60	63.24	70.24	60.57	73.92	75.58
MPII Cooking	34.45	50.08	42.79	40.36	35.81	47.19
Mini-Sports1M	43.02	43.59	46.28	45.56	44.60	47.60
FineGym	54.62	75.87	69.62	68.79	47.60	58.94
MOD20	91.23	92.08	91.94	92.08	90.81	92.36
COIN	61.48	64.53	71.57	72.78	67.64	68.78
MECCANO	32.34	35.10	34.86	33.62	33.30	37.80
InHARD	75.63	87.66	82.54	80.81	71.28	80.10
PETRAW	93.18	95.51	95.02	94.38	85.56	91.46
MISAW	59.04	73.64	70.37	64.27	60.78	68.85

Few-shot Finetuning

Please follow the instructions here to perform few-shot evaluation on BEAR.

The few-shot results are shown below:

Zero-shot Evaluation

We build our zero-shot part based on the popular CLIP and ActionCLIP. Follow the instructions here to evaluate zero-shot performance on BEAR.

The zero-shot results are shown below:

Domain Adaptation

Please follow the instructions here to perform UDA evaluation on BEAR.

The UDA baseline results are shown below:

Dataset	T>M	M>T	MS>MOD	MOD>MS	U>X	X>U	P>MS	Jester	IT>IL	IT>IR	IL>IR	IL>IT	IR>IT	IR>L
Source only	5.32	7.36	18.25	12.76	54.20	33.33	61.45	68.73	4.18	30.39	19.01	22.65	24.14	12.42
Supervised target	70.21	65.13	34.08	35.52	75.06	63.89	94.40	97.61	26.00	83.55	83.55	85.52	85.52	26.00

Name		Name	Last commit message	Last commit date
Latest commit History 130 Commits
benchmark		benchmark
datasets		datasets
figs		figs
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmark

benchmark

datasets

datasets

figs

figs

README.md

README.md

Repository files navigation

BEAR: a new BEnchmark on video Action Recognition

Updates

Introduction

Datasets

Datasets Download and Pre-processing

Pre-trained Models

Benchmark

Standard Finetuning

Few-shot Finetuning

Zero-shot Evaluation

Domain Adaptation

About

Releases

Packages

Contributors 2

Languages

AndongDeng/BEAR

Folders and files

Latest commit

History

Repository files navigation

BEAR: a new BEnchmark on video Action Recognition

Updates

Introduction

Datasets

Datasets Download and Pre-processing

Pre-trained Models

Benchmark

Standard Finetuning

Few-shot Finetuning

Zero-shot Evaluation

Domain Adaptation

About

Resources

Stars

Watchers

Forks

Languages