SSL4VideoSurvey

A collection of works on self-supervised, deep-learning learning for video. The papers listed here refers to our survey:

Self-Supervised Learning for Videos: A Survey

Madeline Chantry Schiappa, Yogesh Singh Rawat, Mubarak Shah

Summary

In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.

Statistics of self-supervised (SSL) video representation learning research in recent years. From left to right we show a) the total number of SSL related papers published in top conference venues, b) categorical breakdown of the main research topics studied in SSL, and (c) modality breakdown of the main modalities used in SSL. The year 2022 remains incomplete because a majority of the conferences occur later in the year.

Action recognition performance of models over time for different self-supervised strategies including different modalities: video-only (V), video-text (V+T), video-audio (V+A), video-text-audio (V+T+A). More recently, contrastive learning has become the most popular strategy.

Training Tasks

Pre-Text Learning

Action Recognition

Downstream evaluation of action recognition on pretext self-supervised learning measured by prediction accuracy. Top scores are in bold. Playback speed related tasks typically perform the best.

Model	Subcategory	Visual Backbone	Pre-Train	UCF101	HMDB51
Geometry	Appearance	AlexNet	UCF101/HMDB51	54.10	22.60
Wang et al.	Appearance	C3D	UCF101	61.20	33.40
3D RotNet	Appearance	3D R-18	MT	62.90	33.70
VideoJigsaw	Jigsaw	CaffeNet	Kinetics	54.70	27.00
3D ST-puzzle	Jigsaw	C3D	Kinetics	65.80	33.70
CSJ	Jigsaw	R(2+3)D	Kinetics+UCF101+HMDB51	79.50	52.60
PRP	Speed	R3D	Kinetics	72.10	35.00
SpeedNet	Speed	S3D-G	Kinetics	81.10	48.80
Jenni et al.	Speed	R(2+1)D	UCF101	87.10	49.80
PacePred	Speed	S3D-G	UCF101	87.10	52.60
ShuffleLearn	Temporal Order	AlexNet	UCF101	50.90	19.80
OPN	Temporal Order	VGG-M	UCF101	59.80	23.80
O3N	Temporal Order	AlexNet	UCF101	60.30	32.50
ClipOrder	Temporal Order	R3D	UCF101	72.40	30.90

Video Retreival

Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.

Model	Category	Subcategory	Visual Backbone	Pre-train	UCF101 R@5	HMDB51 R@5
SpeedNet	Pretext	Speed	S3D-G	Kinetics	28.10	--
ClipOrder	Pretext	Temporal Order	R3D	UCF101	30.30	22.90
OPN	Pretext	Temporal Order	CaffeNet	UCF101	28.70	--
CSJ	Pretext	Jigsaw	R(2+3)D	K/U/H	40.50	--
PRP	Pretext	Speed	R3D	Kinetics	38.50	27.20
Jenni et al.	Pretext	Speed	3D R-18	Kinetics	48.50	--
PacePred	Pretext	Speed	R(2+1)D	UCF101	49.70	32.20

Generative Learning

Action Recognition

Downstream action recognition evaluation for models that use a generative self-supervised pre-training approach. Top scores are in bold

Model	Subcategory	Visual Backbone	Pre-train	UCF101	HMDB51
pSwaV	Contrastive	View Aug.	R-50	Kinetics	51.7
pSimCLR	Contrastive	View Aug.	R-50	Kinetics	52.0
pMoCo	Contrastive	View Aug.	R-50	Kinetics	54.4
pBYOL	Contrastive	View Aug.	R-50	Kinetics	55.8
Mathieu et al.	Frame Prediction	C3D	Sports1M	52.10	--
VideoGan	Reconstruction	VAE	Flickr	52.90	--
Liang et al.	Frame Prediction	LSTM	UCF101	55.10	--
VideoMoCo	Frame Prediction	R(2+1)D	Kinetics	78.70	49.20
MemDPC-Dual	Frame Prediction	R(2+3)D	Kinetics	86.10	54.50
Tian et al.	Reconstruction	3D R-101	Kinetics	88.10	59.00
VideoMAE	MAE	ViT-L	ImageNet	91.3	62.6
MotionMAE	MAE	ViT-B	Kinetics	96.3	--

Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.

Model	Category	Subcategory	Visual Backbone	Pre-Train	SS	Kinetics
BEVT	Generative	MAE	SWIN-B	Kinetics+ImageNet	71.4	81.1
MAE	Generative	MAE	ViT-H	Kinetics	74.1	81.1
MaskFeat	Generative	MAE	MViT	Kinetics	74.4	86.7
VideoMAE	Generative	MAE	ViT-L	ImageNet	75.3	85.1
MotionMAE	Generative	MAE	ViT-B	Kinetics	75.5	81.7

Video Retreival

Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.

Model	Category	Subcategory	Visual Backbone	Pre-train	UCF101 R@5	HMDB51 R@5
MemDPC-RGP	Generative	Frame Prediction	R(2+3)D	Kinetics	40.40	25.70
MemDPC-Flow	Generative	Frame Prediction	R(2+3)D	Kinetics	63.20	37.60

Text-to-Video Retrieval

Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*

Model	Visual	Text	Pre-Train	R@5 YouCook2	R@5 MSRVTT
ActBERT	3D R-32	BERT	Kinetics+How2	26.70	23.40
HERO	SlowFast	WordPieces	How2+TV	--	43.40
ClipBERT	R-50	WordPieces	VisualGenome	--	46.80
VLM	S3D-g	BERT	How2	56.88	55.50
UniVL	S3D-g	BERT	How2	57.60	49.60
Amrani et al.	R-152	Word2Vec	How2	--	21.30

Video Captioning

Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.

Model	Category	Subcategory	Visual	Text	Pre-train	BLEU4	METEOR	ROUGE	CIDEr
VideoBert	Generative	MM	S3D-g	BERT	Kinetics	4.33	11.94	28.80	0.55
ActBERT	Generative	MM	3D R-32	BERT	K/H	5.41	13.30	30.56	0.65
VLM	Generative	MM	S3D-g	BERT	How2	12.27	18.22	41.51	1.39
UniVL	Generative	MM	S3D-g	BERT	How2	17.35	22.35	46.52	1.81

Contrastive Learning

Action Recognition

Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.

Model	Category	Subcategory	Visual Backbone	Pre-Train	SS	Kinetics
pSwaV	Contrastive	View Aug.	R-50	Kinetics	51.7	62.7
pSimCLR	Contrastive	View Aug.	R-50	Kinetics	52.0	62.0
pMoCo	Contrastive	View Aug.	R-50	Kinetics	54.4	69.0
pBYOL	Contrastive	View Aug.	R-50	Kinetics	55.8	71.5

Contrastive

Action Recognition

Downstream action recognition on UCF101 and HMDB51 for models that use contrastive learning and/or cross-modal agreement. Top scores for each category are in bold. Modalities include video (V), optical flow (F), human keypoints (K), text (T) and audio (A). Spatio-temporal augmentations with contrastive learning typically are the highest performing approaches.

Model	Subcategory	Visual	Modalities	Pre-Train	UCF101	HMDB51
VIE	Clustering	Slowfast	V	Kinetics	78.90	50.1
VIE-2pathway	Clustering	R-18	V	Kinetics	80.40	52.5
Tokmakov et al.	Clustering	3D R-18	V	Kinetics	83.00	50.4
TCE	Temporal Aug.	R-50	V	UCF101	71.20	36.6
Lorre et al.	Temporal Aug.	R-18	V+F	UCF101	87.90	55.4
CMC-Dual	Spatial Aug.	CaffeNet	V+F	UCF101	59.10	26.7
SwAV	Spatial Aug.	R-50	V	Kinetics	74.70	--
VDIM	Spatial Aug.	R(2+1)D	V	Kinetics	79.70	49.2
CoCon	Spatial Aug.	R-34	V+F+K	UCF101	82.40	53.1
SimCLR	Spatial Aug.	R-50	V	Kinetics	84.20	--
CoCLR	Spatial Aug.	S3D-G	V+F	UCF101	90.60	62.9
MoCo	Spatial Aug.	R-50	V	Kinetics	90.80	--
BYOL	Spatial Aug.	R-50	V	Kinetics	91.20	--
DVIM	Spatio-Temporal Aug.	R-18	V+F	UCF101	64.00	29.7
IIC	Spatio-Temporal Aug.	R3D	V+F	Kinetics	74.40	38.3
DSM	Spatio-Temporal Aug.	I3D	V	Kinetics	78.20	52.8
pSimCLR	Spatio-Temporal Aug.	R-50	V	Kinetics	87.90	--
TCLR	Spatio-Temporal Aug.	R(2+1)D	V	UCF101	88.20	60.0
SeCo	Spatio-Temporal Aug.	R-50	V	ImageNet	88.30	55.6
pSwaV	Spatio-Temporal Aug.	R-50	V	Kinetics	89.40	--
pBYOL	Spatio-Temporal Aug.	R-50	V	Kinetics	93.80	--
CVRL	Spatio-Temporal Aug.	3D R-50	V	Kinetics	93

Cross-Modal Learning

Text-to-Video Retrieval

Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*

Model	Visual	Text	Pre-Train	R@5 YouCook2	R@5 MSRVTT
ActBERT	3D R-32	BERT	Kinetics+How2	26.70	23.40
HERO	SlowFast	WordPieces	How2+TV	--	43.40
ClipBERT	R-50	WordPieces	VisualGenome	--	46.80
VLM	S3D-g	BERT	How2	56.88	55.50
UniVL	S3D-g	BERT	How2	57.60	49.60
Amrani et al.	R-152	Word2Vec	How2	--	21.30
MIL-NCE	S3D	Word2Vec	How2	38.00	24.00
COOT	S3D-g	BERT	How2+YouCook2	40.20	--
CE*	Experts	NetVLAD	MSRVTT	--	29.00
VideoClip	S3D-g	BERT	How2	50.40	22.20
VATT	Linear Proj.	Linear Proj.	AS+How2	--	--
MEE	Experts	NetVLAD	COCO	--	39.20
JPoSE	TSN	Word2Vec	Kinetics	--	38.10
Amrani et al.*	R-152	Word2Vec	How2	--	41.60
AVLnet*	3D R-101	Word2Vec	How2	55.50	50.50
MMT	Experts	BERT	How2	--	14.40
MMT*	Experts	BERT	How2	--	55.70
Patrick et al.*	Experts	T-5	How2	58.50	--
VideoClip*	S3D-g	BERT	How2	62.60	55.40
FIT	ViT	BERT	COMBO	--	61.50

Video Captioning

Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.

Model	Category	Subcategory	Visual	Text	Pre-train	BLEU4	METEOR	ROUGE	CIDEr
CBT	Cross-Modal	Video+Text	S3D-G	BERT	Kinetics	5.12	12.97	30.44	0.64
COOT	Cross-Modal	Video+Text	S3D-g	BERT	YouCook2	11.30	19.85	37.94	--
VideoBert	Generative	MM	S3D-g	BERT	Kinetics	4.33	11.94	28.80	0.55
ActBERT	Generative	MM	3D R-32	BERT	K/H	5.41	13.30	30.56	0.65
VLM	Generative	MM	S3D-g	BERT	How2	12.27	18.22	41.51	1.39
UniVL	Generative	MM	S3D-g	BERT	How2	17.35	22.35	46.52	1.81

Action Segmentation

Downstream action segmentation evaluation on COIN for models that use a cross-modal agreement self-supervised pre-training approach. The top score is in bold.

Model	Visual	Text	Pre-train	Frame-Acc
CBT	S3D-G	BERT	Kinetics+How2	53.90
ActBERT	3D R-32	BERT	Kinetics+How2	56.95
VideoClip (zs)	S3D-g	BERT	How2	58.90
MIL-NCE	S3D	Word2Vec	How2	61.00
VLM	S3D-g	BERT	How2	68.39
VideoClip (ft)	S3D-g	BERT	How2	68.70
UniVL	S3D-g	BERT	How2	70.20

Temporal Action Step Localization

Downstream temporal action step localization evaluation on CrossTask for models that use a contrastive multimodal self-supervised pre-training approach. Top scores are in bold.

Model	Visual	Text	Pre-train	Recall
VideoClip (zs)	S3D-g	BERT	How2	33.90
MIL-NCE	S3D	Word2Vec	How2	40.50
ActBERT	3D R-32	BERT	Kinetics+How2	41.40
UniVL	S3D-g	BERT	How2	42.00
VLM	S3D-g	BERT	How2	46.50
VideoClip (ft)	S3D-g	BERT	How2	47.30

Evaluation Tasks

Action Recognition

Downstream action recognition on UCF101 and HMDB51 for models that use contrastive learning and/or cross-modal agreement. Top scores for each category are in bold. Modalities include video (V), optical flow (F), human keypoints (K), text (T) and audio (A). Spatio-temporal augmentations with contrastive learning typically are the highest performing approaches.

Model	Subcategory	Visual	Modalities	Pre-Train	UCF101	HMDB51
Geometry	Appearance	AlexNet	V	UCF101/HMDB51	54.10	22.60
Wang et al.	Appearance	C3D	V	UCF101	61.20	33.40
3D RotNet	Appearance	3D R-18	V	MT	62.90	33.70
VideoJigsaw	Jigsaw	CaffeNet	Kinetics	54.70	27.00
3D ST-puzzle	Jigsaw	C3D	V	Kinetics	65.80	33.70
CSJ	Jigsaw	R(2+3)D	V	Kinetics+UCF101+HMDB51	79.50	52.60
PRP	Speed	R3D	V	Kinetics	72.10	35.00
SpeedNet	Speed	S3D-G	V	Kinetics	81.10	48.80
Jenni et al.	Speed	R(2+1)D	V	UCF101	87.10	49.80
PacePred	Speed	S3D-G	V	UCF101	87.10	52.60
ShuffleLearn	Temporal Order	AlexNet	V	UCF101	50.90	19.80
OPN	Temporal Order	VGG-M	V	UCF101	59.80	23.80
O3N	Temporal Order	AlexNet	V	UCF101	60.30	32.50
ClipOrder	Temporal Order	R3D	V	UCF101	72.40	30.90
VIE	Clustering	Slowfast	V	Kinetics	78.90	50.1
VIE-2pathway	Clustering	R-18	V	Kinetics	80.40	52.5
Tokmakov et al.	Clustering	3D R-18	V	Kinetics	83.00	50.4
TCE	Temporal Aug.	R-50	V	UCF101	71.20	36.6
Lorre et al.	Temporal Aug.	R-18	V+F	UCF101	87.90	55.4
CMC-Dual	Spatial Aug.	CaffeNet	V+F	UCF101	59.10	26.7
SwAV	Spatial Aug.	R-50	V	Kinetics	74.70	--
VDIM	Spatial Aug.	R(2+1)D	V	Kinetics	79.70	49.2
CoCon	Spatial Aug.	R-34	V+F+K	UCF101	82.40	53.1
SimCLR	Spatial Aug.	R-50	V	Kinetics	84.20	--
CoCLR	Spatial Aug.	S3D-G	V+F	UCF101	90.60	62.9
MoCo	Spatial Aug.	R-50	V	Kinetics	90.80	--
BYOL	Spatial Aug.	R-50	V	Kinetics	91.20	--
MIL-NCE	Cross-Modal	S3D-G	V+T	How2	61.00	91.3
GDT	Cross-Modal	R(2+1)D	V+T+A	Kinetics	72.80	95.5
CBT	Cross-Modal	S3D-G	V+T	Kinetics	79.50	44.6
VATT	Cross-Modal	Transformer	V+T	AS+How2	85.50	64.8
AVTS	Cross-Modal	MC3	V+A	Kinetics	85.80	56.9
AVID+Cross	Cross-Modal	R(2+1)D	V+A	Kinetics	91.00	64.1
AVID+CMA	Cross-Modal	R(2+1)D	V+A	Kinetics	91.50	64.7
MMV-FAC	Cross-Modal	TSM	V+T+A	AS+How2	91.80	67.1
XDC	Cross-Modal	R(2+1)D	V+A	Kinetics	95.50	68.9
DVIM	Spatio-Temporal Aug.	R-18	V+F	UCF101	64.00	29.7
IIC	Spatio-Temporal Aug.	R3D	V+F	Kinetics	74.40	38.3
DSM	Spatio-Temporal Aug.	I3D	V	Kinetics	78.20	52.8
pSimCLR	Spatio-Temporal Aug.	R-50	V	Kinetics	87.90	--
TCLR	Spatio-Temporal Aug.	R(2+1)D	V	UCF101	88.20	60.0
SeCo	Spatio-Temporal Aug.	R-50	V	ImageNet	88.30	55.6
pSwaV	Spatio-Temporal Aug.	R-50	V	Kinetics	89.40	--
pBYOL	Spatio-Temporal Aug.	R-50	V	Kinetics	93.80	--
CVRL	Spatio-Temporal Aug.	3D R-50	V	Kinetics	93

Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.

Model	Category	Subcategory	Visual Backbone	Pre-Train	SS	Kinetics
pSwaV	Contrastive	View Aug.	R-50	Kinetics	51.7	62.7
pSimCLR	Contrastive	View Aug.	R-50	Kinetics	52.0	62.0
pMoCo	Contrastive	View Aug.	R-50	Kinetics	54.4	69.0
pBYOL	Contrastive	View Aug.	R-50	Kinetics	55.8	71.5
BEVT	Generative	MAE	SWIN-B	Kinetics+ImageNet	71.4	81.1
MAE	Generative	MAE	ViT-H	Kinetics	74.1	81.1
MaskFeat	Generative	MAE	MViT	Kinetics	74.4	86.7
VideoMAE	Generative	MAE	ViT-L	ImageNet	75.3	85.1
MotionMAE	Generative	MAE	ViT-B	Kinetics	75.5	81.7

Video Retrieval

Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.

Model	Category	Subcategory	Visual Backbone	Pre-train	UCF101 R@5	HMDB51 R@5
SpeedNet	Pretext	Speed	S3D-G	Kinetics	28.10	--
ClipOrder	Pretext	Temporal Order	R3D	UCF101	30.30	22.90
OPN	Pretext	Temporal Order	CaffeNet	UCF101	28.70	--
CSJ	Pretext	Jigsaw	R(2+3)D	K/U/H	40.50	--
PRP	Pretext	Speed	R3D	Kinetics	38.50	27.20
Jenni et al.	Pretext	Speed	3D R-18	Kinetics	48.50	--
PacePred	Pretext	Speed	R(2+1)D	UCF101	49.70	32.20
MemDPC-RGP	Generative	Frame Prediction	R(2+3)D	Kinetics	40.40	25.70
MemDPC-Flow	Generative	Frame Prediction	R(2+3)D	Kinetics	63.20	37.60
DSM	Contrastive	Spatio-Temporal	I3D	Kinetics	35.20	25.90
IIC	Contrastive	Spatio-Temporal	R-18	UCF101	60.90	42.90
SeLaVi	Cross-Modal	Video+Audio	R(2+1)D	Kinetics	68.60	47.60
CoCLR	Contrastive	View Augmentation	S3D-G	UCF101	70.80	45.80
GDT	Cross-Modal	Video+Audio	R(2+1)D	Kinetics	79.00	51.70

Video Captioning

Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.

Model	Category	Subcategory	Visual	Text	Pre-train	BLEU4	METEOR	ROUGE	CIDEr
CBT	Cross-Modal	Video+Text	S3D-G	BERT	Kinetics	5.12	12.97	30.44	0.64
COOT	Cross-Modal	Video+Text	S3D-g	BERT	YouCook2	11.30	19.85	37.94	--
VideoBert	Generative	MM	S3D-g	BERT	Kinetics	4.33	11.94	28.80	0.55
ActBERT	Generative	MM	3D R-32	BERT	K/H	5.41	13.30	30.56	0.65
VLM	Generative	MM	S3D-g	BERT	How2	12.27	18.22	41.51	1.39
UniVL	Generative	MM	S3D-g	BERT	How2	17.35	22.35	46.52	1.81

Text-to-Video Retrieval

Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*

Model	Visual	Text	Pre-Train	R@5 YouCook2	R@5 MSRVTT
ActBERT	3D R-32	BERT	Kinetics+How2	26.70	23.40
HERO	SlowFast	WordPieces	How2+TV	--	43.40
ClipBERT	R-50	WordPieces	VisualGenome	--	46.80
VLM	S3D-g	BERT	How2	56.88	55.50
UniVL	S3D-g	BERT	How2	57.60	49.60
Amrani et al.	R-152	Word2Vec	How2	--	21.30
MIL-NCE	S3D	Word2Vec	How2	38.00	24.00
COOT	S3D-g	BERT	How2+YouCook2	40.20	--
CE*	Experts	NetVLAD	MSRVTT	--	29.00
VideoClip	S3D-g	BERT	How2	50.40	22.20
VATT	Linear Proj.	Linear Proj.	AS+How2	--	--
MEE	Experts	NetVLAD	COCO	--	39.20
JPoSE	TSN	Word2Vec	Kinetics	--	38.10
Amrani et al.*	R-152	Word2Vec	How2	--	41.60
AVLnet*	3D R-101	Word2Vec	How2	55.50	50.50
MMT	Experts	BERT	How2	--	14.40
MMT*	Experts	BERT	How2	--	55.70
Patrick et al.*	Experts	T-5	How2	58.50	--
VideoClip*	S3D-g	BERT	How2	62.60	55.40
FIT	ViT	BERT	COMBO	--	61.50

Datasets

Dataset	Labels	Modalities	Classes	Videos	Tasks
ActivityNet (ActN)	Activity, Captions, Bounding Box	Video, Video+Text	200	19,995	Action-Recognition, Video Captioning, Video Grounding
AVA	Activity, Face Tracks	Video, Video+Audio	80	430	Action-Recognition,Audio-Visual Grounding
Breakfast	Activity	Video	10	1,989	Action Recognition, Action Segmentation
Charades	Activity, Objects, Indoor Scenes, Verbs	Video	157	9,848	Action-Recognition, Object Recognition, Scene Recognition, Temporal Action Step Localization
COIN	Activity, Temporal Actions, ASR	Video, Video+Text	180	11,827	Action-Recognition, Action Segmentation, Video-Retrieval
CrossTask	Temporal Steps, Activity	Video	83	4,700	Temporal Action Step Localization, Recognition
HMDB51	Activity	Video	51	6,849	Action-Recognition, Video-Retrieval
HowTo100M (How2)	ASR	Video+Text	-	136M	Text-to-Video Retrieval, VideoQA
Kinetics	Activity	Video	400/600/700	1/2 M	Action-Recognition
MSRVTT	Activity, Captions	Video+Text	20	10,000	Action-Recognition, Video-Captioning, Video-Retrieval, Visual-Question Answering
MultiThumos	Activity, Temporal Steps	Video	65	400	Action Recognition, Temporal Action Step Localization
UCF101	Activity	Video	101	13,320	Recognition, Video-Retrieval
YouCook2	Captions	Video+Text	89	2,000	Video Captioning, Video-Retrieval
YouTube-8M	Activity	Video	4,716	8M	Action Recognition

Citation

@article{schiappa_survey_ssl_video,
author = {Schiappa, Madeline C. and Rawat, Yogesh S. and Shah, Mubarak},
title = {Self-Supervised Learning for Videos: A Survey},
year = {2023},
issue_date = {December 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {55},
number = {13s},
issn = {0360-0300},
url = {https://doi.org/10.1145/3577925},
doi = {10.1145/3577925},
journal = {ACM Comput. Surv.},
month = {jul},
articleno = {288},
numpages = {37},
}

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
Figures		Figures
PAPERS.md		PAPERS.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Figures

Figures

PAPERS.md

PAPERS.md

README.md

README.md

Repository files navigation

SSL4VideoSurvey

Summary

Training Tasks

Pre-Text Learning

Action Recognition

Video Retreival

Generative Learning

Action Recognition

Video Retreival

Text-to-Video Retrieval

Video Captioning

Contrastive Learning

Action Recognition

Contrastive

Action Recognition

Cross-Modal Learning

Text-to-Video Retrieval

Video Captioning

Action Segmentation

Temporal Action Step Localization

Evaluation Tasks

Action Recognition

Video Retrieval

Video Captioning

Text-to-Video Retrieval

Datasets

Citation

About

Releases

Packages

Maddy12/SSL4VideoSurvey

Folders and files

Latest commit

History

Repository files navigation

SSL4VideoSurvey

Summary

Training Tasks

Pre-Text Learning

Action Recognition

Video Retreival

Generative Learning

Action Recognition

Video Retreival

Text-to-Video Retrieval

Video Captioning

Contrastive Learning

Action Recognition

Contrastive

Action Recognition

Cross-Modal Learning

Text-to-Video Retrieval

Video Captioning

Action Segmentation

Temporal Action Step Localization

Evaluation Tasks

Action Recognition

Video Retrieval

Video Captioning

Text-to-Video Retrieval

Datasets

Citation

About

Topics

Resources

Stars

Watchers

Forks