Skip to content

The official GitHub page for the survey paper "Self-Supervised learning for Videos: A survey"

Notifications You must be signed in to change notification settings

Maddy12/SSL4VideoSurvey

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 

Repository files navigation

SSL4VideoSurvey

A collection of works on self-supervised, deep-learning learning for video. The papers listed here refers to our survey:

Self-Supervised Learning for Videos: A Survey

Madeline Chantry Schiappa, Yogesh Singh Rawat, Mubarak Shah

Summary

In this survey, we provide a review of existing approaches on self-supervised learning focusing on the video domain. We summarize these methods into four different categories based on their learning objectives: 1) pretext tasks, 2) generative learning, 3) contrastive learning, and 4) cross-modal agreement. We further introduce the commonly used datasets, downstream evaluation tasks, insights into the limitations of existing works, and the potential future directions in this area.

Overview of publications Statistics of self-supervised (SSL) video representation learning research in recent years. From left to right we show a) the total number of SSL related papers published in top conference venues, b) categorical breakdown of the main research topics studied in SSL, and (c) modality breakdown of the main modalities used in SSL. The year 2022 remains incomplete because a majority of the conferences occur later in the year.

Overview of publications related to Action Recognition Action recognition performance of models over time for different self-supervised strategies including different modalities: video-only (V), video-text (V+T), video-audio (V+A), video-text-audio (V+T+A). More recently, contrastive learning has become the most popular strategy.

Training Tasks

Pre-Text Learning

Action Recognition

Downstream evaluation of action recognition on pretext self-supervised learning measured by prediction accuracy. Top scores are in bold. Playback speed related tasks typically perform the best.

Model Subcategory Visual Backbone Pre-Train UCF101 HMDB51
Geometry Appearance AlexNet UCF101/HMDB51 54.10 22.60
Wang et al. Appearance C3D UCF101 61.20 33.40
3D RotNet Appearance 3D R-18 MT 62.90 33.70
VideoJigsaw Jigsaw CaffeNet Kinetics 54.70 27.00
3D ST-puzzle Jigsaw C3D Kinetics 65.80 33.70
CSJ Jigsaw R(2+3)D Kinetics+UCF101+HMDB51 79.50 52.60
PRP Speed R3D Kinetics 72.10 35.00
SpeedNet Speed S3D-G Kinetics 81.10 48.80
Jenni et al. Speed R(2+1)D UCF101 87.10 49.80
PacePred Speed S3D-G UCF101 87.10 52.60
ShuffleLearn Temporal Order AlexNet UCF101 50.90 19.80
OPN Temporal Order VGG-M UCF101 59.80 23.80
O3N Temporal Order AlexNet UCF101 60.30 32.50
ClipOrder Temporal Order R3D UCF101 72.40 30.90

Video Retreival

Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.

Model Category Subcategory Visual Backbone Pre-train UCF101 R@5 HMDB51 R@5
SpeedNet Pretext Speed S3D-G Kinetics 28.10 --
ClipOrder Pretext Temporal Order R3D UCF101 30.30 22.90
OPN Pretext Temporal Order CaffeNet UCF101 28.70 --
CSJ Pretext Jigsaw R(2+3)D K/U/H 40.50 --
PRP Pretext Speed R3D Kinetics 38.50 27.20
Jenni et al. Pretext Speed 3D R-18 Kinetics 48.50 --
PacePred Pretext Speed R(2+1)D UCF101 49.70 32.20

Generative Learning

Action Recognition

Downstream action recognition evaluation for models that use a generative self-supervised pre-training approach. Top scores are in bold

Model Subcategory Visual Backbone Pre-train UCF101 HMDB51
pSwaV Contrastive View Aug. R-50 Kinetics 51.7
pSimCLR Contrastive View Aug. R-50 Kinetics 52.0
pMoCo Contrastive View Aug. R-50 Kinetics 54.4
pBYOL Contrastive View Aug. R-50 Kinetics 55.8
Mathieu et al. Frame Prediction C3D Sports1M 52.10 --
VideoGan Reconstruction VAE Flickr 52.90 --
Liang et al. Frame Prediction LSTM UCF101 55.10 --
VideoMoCo Frame Prediction R(2+1)D Kinetics 78.70 49.20
MemDPC-Dual Frame Prediction R(2+3)D Kinetics 86.10 54.50
Tian et al. Reconstruction 3D R-101 Kinetics 88.10 59.00
VideoMAE MAE ViT-L ImageNet 91.3 62.6
MotionMAE MAE ViT-B Kinetics 96.3 --

Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.

Model Category Subcategory Visual Backbone Pre-Train SS Kinetics
BEVT Generative MAE SWIN-B Kinetics+ImageNet 71.4 81.1
MAE Generative MAE ViT-H Kinetics 74.1 81.1
MaskFeat Generative MAE MViT Kinetics 74.4 86.7
VideoMAE Generative MAE ViT-L ImageNet 75.3 85.1
MotionMAE Generative MAE ViT-B Kinetics 75.5 81.7

Video Retreival

Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.

Model Category Subcategory Visual Backbone Pre-train UCF101 R@5 HMDB51 R@5
MemDPC-RGP Generative Frame Prediction R(2+3)D Kinetics 40.40 25.70
MemDPC-Flow Generative Frame Prediction R(2+3)D Kinetics 63.20 37.60

Text-to-Video Retrieval

Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*

Model Visual Text Pre-Train R@5 YouCook2 R@5 MSRVTT
ActBERT 3D R-32 BERT Kinetics+How2 26.70 23.40
HERO SlowFast WordPieces How2+TV -- 43.40
ClipBERT R-50 WordPieces VisualGenome -- 46.80
VLM S3D-g BERT How2 56.88 55.50
UniVL S3D-g BERT How2 57.60 49.60
Amrani et al. R-152 Word2Vec How2 -- 21.30

Video Captioning

Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.

Model Category Subcategory Visual Text Pre-train BLEU4 METEOR ROUGE CIDEr
VideoBert Generative MM S3D-g BERT Kinetics 4.33 11.94 28.80 0.55
ActBERT Generative MM 3D R-32 BERT K/H 5.41 13.30 30.56 0.65
VLM Generative MM S3D-g BERT How2 12.27 18.22 41.51 1.39
UniVL Generative MM S3D-g BERT How2 17.35 22.35 46.52 1.81

Contrastive Learning

Action Recognition

Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.

Model Category Subcategory Visual Backbone Pre-Train SS Kinetics
pSwaV Contrastive View Aug. R-50 Kinetics 51.7 62.7
pSimCLR Contrastive View Aug. R-50 Kinetics 52.0 62.0
pMoCo Contrastive View Aug. R-50 Kinetics 54.4 69.0
pBYOL Contrastive View Aug. R-50 Kinetics 55.8 71.5

Contrastive

Action Recognition

Downstream action recognition on UCF101 and HMDB51 for models that use contrastive learning and/or cross-modal agreement. Top scores for each category are in bold. Modalities include video (V), optical flow (F), human keypoints (K), text (T) and audio (A). Spatio-temporal augmentations with contrastive learning typically are the highest performing approaches.

Model Subcategory Visual Modalities Pre-Train UCF101 HMDB51
VIE Clustering Slowfast V Kinetics 78.90 50.1
VIE-2pathway Clustering R-18 V Kinetics 80.40 52.5
Tokmakov et al. Clustering 3D R-18 V Kinetics 83.00 50.4
TCE Temporal Aug. R-50 V UCF101 71.20 36.6
Lorre et al. Temporal Aug. R-18 V+F UCF101 87.90 55.4
CMC-Dual Spatial Aug. CaffeNet V+F UCF101 59.10 26.7
SwAV Spatial Aug. R-50 V Kinetics 74.70 --
VDIM Spatial Aug. R(2+1)D V Kinetics 79.70 49.2
CoCon Spatial Aug. R-34 V+F+K UCF101 82.40 53.1
SimCLR Spatial Aug. R-50 V Kinetics 84.20 --
CoCLR Spatial Aug. S3D-G V+F UCF101 90.60 62.9
MoCo Spatial Aug. R-50 V Kinetics 90.80 --
BYOL Spatial Aug. R-50 V Kinetics 91.20 --
DVIM Spatio-Temporal Aug. R-18 V+F UCF101 64.00 29.7
IIC Spatio-Temporal Aug. R3D V+F Kinetics 74.40 38.3
DSM Spatio-Temporal Aug. I3D V Kinetics 78.20 52.8
pSimCLR Spatio-Temporal Aug. R-50 V Kinetics 87.90 --
TCLR Spatio-Temporal Aug. R(2+1)D V UCF101 88.20 60.0
SeCo Spatio-Temporal Aug. R-50 V ImageNet 88.30 55.6
pSwaV Spatio-Temporal Aug. R-50 V Kinetics 89.40 --
pBYOL Spatio-Temporal Aug. R-50 V Kinetics 93.80 --
CVRL Spatio-Temporal Aug. 3D R-50 V Kinetics 93

Cross-Modal Learning

Text-to-Video Retrieval

Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*

Model Visual Text Pre-Train R@5 YouCook2 R@5 MSRVTT
ActBERT 3D R-32 BERT Kinetics+How2 26.70 23.40
HERO SlowFast WordPieces How2+TV -- 43.40
ClipBERT R-50 WordPieces VisualGenome -- 46.80
VLM S3D-g BERT How2 56.88 55.50
UniVL S3D-g BERT How2 57.60 49.60
Amrani et al. R-152 Word2Vec How2 -- 21.30
MIL-NCE S3D Word2Vec How2 38.00 24.00
COOT S3D-g BERT How2+YouCook2 40.20 --
CE* Experts NetVLAD MSRVTT -- 29.00
VideoClip S3D-g BERT How2 50.40 22.20
VATT Linear Proj. Linear Proj. AS+How2 -- --
MEE Experts NetVLAD COCO -- 39.20
JPoSE TSN Word2Vec Kinetics -- 38.10
Amrani et al.* R-152 Word2Vec How2 -- 41.60
AVLnet* 3D R-101 Word2Vec How2 55.50 50.50
MMT Experts BERT How2 -- 14.40
MMT* Experts BERT How2 -- 55.70
Patrick et al.* Experts T-5 How2 58.50 --
VideoClip* S3D-g BERT How2 62.60 55.40
FIT ViT BERT COMBO -- 61.50

Video Captioning

Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.

Model Category Subcategory Visual Text Pre-train BLEU4 METEOR ROUGE CIDEr
CBT Cross-Modal Video+Text S3D-G BERT Kinetics 5.12 12.97 30.44 0.64
COOT Cross-Modal Video+Text S3D-g BERT YouCook2 11.30 19.85 37.94 --
VideoBert Generative MM S3D-g BERT Kinetics 4.33 11.94 28.80 0.55
ActBERT Generative MM 3D R-32 BERT K/H 5.41 13.30 30.56 0.65
VLM Generative MM S3D-g BERT How2 12.27 18.22 41.51 1.39
UniVL Generative MM S3D-g BERT How2 17.35 22.35 46.52 1.81

Action Segmentation

Downstream action segmentation evaluation on COIN for models that use a cross-modal agreement self-supervised pre-training approach. The top score is in bold.

Model Visual Text Pre-train Frame-Acc
CBT S3D-G BERT Kinetics+How2 53.90
ActBERT 3D R-32 BERT Kinetics+How2 56.95
VideoClip (zs) S3D-g BERT How2 58.90
MIL-NCE S3D Word2Vec How2 61.00
VLM S3D-g BERT How2 68.39
VideoClip (ft) S3D-g BERT How2 68.70
UniVL S3D-g BERT How2 70.20

Temporal Action Step Localization

Downstream temporal action step localization evaluation on CrossTask for models that use a contrastive multimodal self-supervised pre-training approach. Top scores are in bold.

Model Visual Text Pre-train Recall
VideoClip (zs) S3D-g BERT How2 33.90
MIL-NCE S3D Word2Vec How2 40.50
ActBERT 3D R-32 BERT Kinetics+How2 41.40
UniVL S3D-g BERT How2 42.00
VLM S3D-g BERT How2 46.50
VideoClip (ft) S3D-g BERT How2 47.30

Evaluation Tasks

Action Recognition

Downstream action recognition on UCF101 and HMDB51 for models that use contrastive learning and/or cross-modal agreement. Top scores for each category are in bold. Modalities include video (V), optical flow (F), human keypoints (K), text (T) and audio (A). Spatio-temporal augmentations with contrastive learning typically are the highest performing approaches.

Model Subcategory Visual Modalities Pre-Train UCF101 HMDB51
Geometry Appearance AlexNet V UCF101/HMDB51 54.10 22.60
Wang et al. Appearance C3D V UCF101 61.20 33.40
3D RotNet Appearance 3D R-18 V MT 62.90 33.70
VideoJigsaw Jigsaw CaffeNet Kinetics 54.70 27.00
3D ST-puzzle Jigsaw C3D V Kinetics 65.80 33.70
CSJ Jigsaw R(2+3)D V Kinetics+UCF101+HMDB51 79.50 52.60
PRP Speed R3D V Kinetics 72.10 35.00
SpeedNet Speed S3D-G V Kinetics 81.10 48.80
Jenni et al. Speed R(2+1)D V UCF101 87.10 49.80
PacePred Speed S3D-G V UCF101 87.10 52.60
ShuffleLearn Temporal Order AlexNet V UCF101 50.90 19.80
OPN Temporal Order VGG-M V UCF101 59.80 23.80
O3N Temporal Order AlexNet V UCF101 60.30 32.50
ClipOrder Temporal Order R3D V UCF101 72.40 30.90
VIE Clustering Slowfast V Kinetics 78.90 50.1
VIE-2pathway Clustering R-18 V Kinetics 80.40 52.5
Tokmakov et al. Clustering 3D R-18 V Kinetics 83.00 50.4
TCE Temporal Aug. R-50 V UCF101 71.20 36.6
Lorre et al. Temporal Aug. R-18 V+F UCF101 87.90 55.4
CMC-Dual Spatial Aug. CaffeNet V+F UCF101 59.10 26.7
SwAV Spatial Aug. R-50 V Kinetics 74.70 --
VDIM Spatial Aug. R(2+1)D V Kinetics 79.70 49.2
CoCon Spatial Aug. R-34 V+F+K UCF101 82.40 53.1
SimCLR Spatial Aug. R-50 V Kinetics 84.20 --
CoCLR Spatial Aug. S3D-G V+F UCF101 90.60 62.9
MoCo Spatial Aug. R-50 V Kinetics 90.80 --
BYOL Spatial Aug. R-50 V Kinetics 91.20 --
MIL-NCE Cross-Modal S3D-G V+T How2 61.00 91.3
GDT Cross-Modal R(2+1)D V+T+A Kinetics 72.80 95.5
CBT Cross-Modal S3D-G V+T Kinetics 79.50 44.6
VATT Cross-Modal Transformer V+T AS+How2 85.50 64.8
AVTS Cross-Modal MC3 V+A Kinetics 85.80 56.9
AVID+Cross Cross-Modal R(2+1)D V+A Kinetics 91.00 64.1
AVID+CMA Cross-Modal R(2+1)D V+A Kinetics 91.50 64.7
MMV-FAC Cross-Modal TSM V+T+A AS+How2 91.80 67.1
XDC Cross-Modal R(2+1)D V+A Kinetics 95.50 68.9
DVIM Spatio-Temporal Aug. R-18 V+F UCF101 64.00 29.7
IIC Spatio-Temporal Aug. R3D V+F Kinetics 74.40 38.3
DSM Spatio-Temporal Aug. I3D V Kinetics 78.20 52.8
pSimCLR Spatio-Temporal Aug. R-50 V Kinetics 87.90 --
TCLR Spatio-Temporal Aug. R(2+1)D V UCF101 88.20 60.0
SeCo Spatio-Temporal Aug. R-50 V ImageNet 88.30 55.6
pSwaV Spatio-Temporal Aug. R-50 V Kinetics 89.40 --
pBYOL Spatio-Temporal Aug. R-50 V Kinetics 93.80 --
CVRL Spatio-Temporal Aug. 3D R-50 V Kinetics 93

Downstream evaluation of action recognition on self-supervised learning measured by prediction accuracy for Something-Something (SS) and Kinetics400 (Kinetics). SS is a more temporally relevant dataset and therefore is more challenging. Top scores for each category are in bold and second best scores \underline{underlined}.

Model Category Subcategory Visual Backbone Pre-Train SS Kinetics
pSwaV Contrastive View Aug. R-50 Kinetics 51.7 62.7
pSimCLR Contrastive View Aug. R-50 Kinetics 52.0 62.0
pMoCo Contrastive View Aug. R-50 Kinetics 54.4 69.0
pBYOL Contrastive View Aug. R-50 Kinetics 55.8 71.5
BEVT Generative MAE SWIN-B Kinetics+ImageNet 71.4 81.1
MAE Generative MAE ViT-H Kinetics 74.1 81.1
MaskFeat Generative MAE MViT Kinetics 74.4 86.7
VideoMAE Generative MAE ViT-L ImageNet 75.3 85.1
MotionMAE Generative MAE ViT-B Kinetics 75.5 81.7

Video Retrieval

Performance for the downstream video retrieval task with top scores for each category in bold. K/U/H indicates using all three datasets for pre-training, i.e. Kinetics, UCF101, and HMDB51.

Model Category Subcategory Visual Backbone Pre-train UCF101 R@5 HMDB51 R@5
SpeedNet Pretext Speed S3D-G Kinetics 28.10 --
ClipOrder Pretext Temporal Order R3D UCF101 30.30 22.90
OPN Pretext Temporal Order CaffeNet UCF101 28.70 --
CSJ Pretext Jigsaw R(2+3)D K/U/H 40.50 --
PRP Pretext Speed R3D Kinetics 38.50 27.20
Jenni et al. Pretext Speed 3D R-18 Kinetics 48.50 --
PacePred Pretext Speed R(2+1)D UCF101 49.70 32.20
MemDPC-RGP Generative Frame Prediction R(2+3)D Kinetics 40.40 25.70
MemDPC-Flow Generative Frame Prediction R(2+3)D Kinetics 63.20 37.60
DSM Contrastive Spatio-Temporal I3D Kinetics 35.20 25.90
IIC Contrastive Spatio-Temporal R-18 UCF101 60.90 42.90
SeLaVi Cross-Modal Video+Audio R(2+1)D Kinetics 68.60 47.60
CoCLR Contrastive View Augmentation S3D-G UCF101 70.80 45.80
GDT Cross-Modal Video+Audio R(2+1)D Kinetics 79.00 51.70

Video Captioning

Downstream evaluation for video captioning on the YouCook2 dataset for video-language models. Top scores are in bold. MM: Masked modeling with video and text, and K/H: Kinetics+HowTo100M.

Model Category Subcategory Visual Text Pre-train BLEU4 METEOR ROUGE CIDEr
CBT Cross-Modal Video+Text S3D-G BERT Kinetics 5.12 12.97 30.44 0.64
COOT Cross-Modal Video+Text S3D-g BERT YouCook2 11.30 19.85 37.94 --
VideoBert Generative MM S3D-g BERT Kinetics 4.33 11.94 28.80 0.55
ActBERT Generative MM 3D R-32 BERT K/H 5.41 13.30 30.56 0.65
VLM Generative MM S3D-g BERT How2 12.27 18.22 41.51 1.39
UniVL Generative MM S3D-g BERT How2 17.35 22.35 46.52 1.81

Text-to-Video Retrieval

Performance for the downstream video retrieval task. Top scores for each category are in bold. Masked Modeling (MM) is a generative approach that uses both video with text. Cross-modal agreement include a variety of contrastive approaches that can use video with audio and/or text. Cross-modal agreement pre-training approaches typically perform best. Some models have dedicated variations in what they report with fine-tuning () on the target dataset, YouCook2 or MSRVTT. The pre-training datasets titled COMBO are CC3M, WV-2M and COCO.*

Model Visual Text Pre-Train R@5 YouCook2 R@5 MSRVTT
ActBERT 3D R-32 BERT Kinetics+How2 26.70 23.40
HERO SlowFast WordPieces How2+TV -- 43.40
ClipBERT R-50 WordPieces VisualGenome -- 46.80
VLM S3D-g BERT How2 56.88 55.50
UniVL S3D-g BERT How2 57.60 49.60
Amrani et al. R-152 Word2Vec How2 -- 21.30
MIL-NCE S3D Word2Vec How2 38.00 24.00
COOT S3D-g BERT How2+YouCook2 40.20 --
CE* Experts NetVLAD MSRVTT -- 29.00
VideoClip S3D-g BERT How2 50.40 22.20
VATT Linear Proj. Linear Proj. AS+How2 -- --
MEE Experts NetVLAD COCO -- 39.20
JPoSE TSN Word2Vec Kinetics -- 38.10
Amrani et al.* R-152 Word2Vec How2 -- 41.60
AVLnet* 3D R-101 Word2Vec How2 55.50 50.50
MMT Experts BERT How2 -- 14.40
MMT* Experts BERT How2 -- 55.70
Patrick et al.* Experts T-5 How2 58.50 --
VideoClip* S3D-g BERT How2 62.60 55.40
FIT ViT BERT COMBO -- 61.50

Datasets

Dataset Labels Modalities Classes Videos Tasks
ActivityNet (ActN) Activity, Captions, Bounding Box Video, Video+Text 200 19,995 Action-Recognition, Video Captioning, Video Grounding
AVA Activity, Face Tracks Video, Video+Audio 80 430 Action-Recognition,Audio-Visual Grounding
Breakfast Activity Video 10 1,989 Action Recognition, Action Segmentation
Charades Activity, Objects, Indoor Scenes, Verbs Video 157 9,848 Action-Recognition, Object Recognition, Scene Recognition, Temporal Action Step Localization
COIN Activity, Temporal Actions, ASR Video, Video+Text 180 11,827 Action-Recognition, Action Segmentation, Video-Retrieval
CrossTask Temporal Steps, Activity Video 83 4,700 Temporal Action Step Localization, Recognition
HMDB51 Activity Video 51 6,849 Action-Recognition, Video-Retrieval
HowTo100M (How2) ASR Video+Text - 136M Text-to-Video Retrieval, VideoQA
Kinetics Activity Video 400/600/700 1/2 M Action-Recognition
MSRVTT Activity, Captions Video+Text 20 10,000 Action-Recognition, Video-Captioning, Video-Retrieval, Visual-Question Answering
MultiThumos Activity, Temporal Steps Video 65 400 Action Recognition, Temporal Action Step Localization
UCF101 Activity Video 101 13,320 Recognition, Video-Retrieval
YouCook2 Captions Video+Text 89 2,000 Video Captioning, Video-Retrieval
YouTube-8M Activity Video 4,716 8M Action Recognition

Citation

@article{schiappa_survey_ssl_video,
author = {Schiappa, Madeline C. and Rawat, Yogesh S. and Shah, Mubarak},
title = {Self-Supervised Learning for Videos: A Survey},
year = {2023},
issue_date = {December 2023},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {55},
number = {13s},
issn = {0360-0300},
url = {https://doi.org/10.1145/3577925},
doi = {10.1145/3577925},
journal = {ACM Comput. Surv.},
month = {jul},
articleno = {288},
numpages = {37},
}