Temporal Span Proposal Network for Video Relation Detection

⚠️ WARNING
This repo is not done yet...

Implementation of "What and When to Look?: Temporal Span Proposal Network for Video Visual Relation Detection"

Identifying relations between objects is central to understanding the scene. While several works have been proposed for relation modeling in the image domain, there have been many constraints in the video domain due to challenging dynamics of spatio-temporal interactions (e.g., Between which objects are there an interaction? When do relations occur and end?). To date, two representative methods have been proposed to tackle Video Visual Relation Detection (VidVRD): segment-based and window-based. We first point out the limitations these two methods have and propose Temporal Span Proposal Network (TSPN), a novel method with two advantages in terms of efficiency and effectiveness. 1) TSPN tells what to look: it sparsifies relation search space by scoring relationness (i.e., confidence score for the existence of a relation between pair of objects) of object pair. 2) TSPN tells when to look: it leverages the full video context to simultaneously predict the temporal span and categories of the entire relations. TSPN demonstrates its effectiveness by achieving new state-of-the-art by a significant margin on two VidVRD benchmarks (ImageNet-VidVDR and VidOR) while also showing lower time complexity than existing methods - in particular, twice as efficient as a popular segment-based approach.

Video Visual Realtion Detection (VidVRD) & Video Object Relation (VidOR)

Dataset & Annotations (VidVRD & VidOR)
Preprocessing (videos -> frames)
Aannotations -> COCO format
Object Detection w/ Faster R-CNN
Multi Object Tracking w/ DeepSORT
Relation Detection Baseline

Dataset & Annotations

VidVRD dataset can be found in here.
VidOR dataset can be found in here.

Preprocessing videos into frames

For VidVRD, use vidvrd_to_image.sh (will take about 1 hour).
For VidOR, use vidor_to_image.sh (will take about 7-8 hours).

Annotation format

The VidVRD json file contains a dictionary sturctured like:

{
    "video_id": "ILSVRC2015_train_00010001",        # Video ID from the original ImageNet ILSVRC2016 video dataset
    "frame_count": 219,
    "fps": 30, 
    "width": 1920, 
    "height": 1080, 
    "subject/objects": [                            # List of subject/objects
        {
            "tid": 0,                               # Trajectory ID of a subject/object
            "category": "bicycle"
        }, 
        ...
    ], 
     "trajectories": [                              # List of frames
        [                                           # List of bounding boxes in each frame
            {
                "tid": 0,                       
                "bbox": {
                    "xmin": 672,                    # left
                    "ymin": 560,                    # top
                    "xmax": 781,                    # right
                    "ymax": 693                     # bottom
                }, 
                "generated": 0,                     # 0 - the bounding box is manually labeled
                                                    # 1 - the bounding box is automatically generated
            }, 
            ...
        ],
        ...
    ]
    "relation_instances": [                         # List of annotated visual relation instances
        {
            "subject_tid": 0,                       # Corresponding trajectory ID of the subject
            "object_tid": 1,                        # Corresponding trajectory ID of the object
            "predicate": "move_right", 
            "begin_fid": 0,                         # Frame index where this relation begins (inclusive)
            "end_fid": 210                          # Frame index where this relation ends (exclusive)
        }, 
        ...
    ]
}

The VidOR json file contains a dictionary sturctured like (both are identical except the "version", "video_hash", and "video_path"):

{
    "version": "VERSION 1.0",
    "video_id": "5159741010",                       # Video ID in YFCC100M collection
    "video_hash": "6c7a58bb458b271f2d9b45de63f3a2", # Video hash offically used for indexing in YFCC100M collection 
    "video_path": "1025/5159741010.mp4",            # Relative path name in this dataset
    "frame_count": 219,
    "fps": 29.97002997002997, 
    "width": 1920, 
    "height": 1080, 
    "subject/objects": [                            # List of subject/objects
        {
            "tid": 0,                               # Trajectory ID of a subject/object
            "category": "bicycle"
        }, 
        ...
    ], 
    "trajectories": [                               # List of frames
        [                                           # List of bounding boxes in each frame
            {                                       # The bounding box at the 1st frame
                "tid": 0,                           # The trajectory ID to which the bounding box belongs
                "bbox": {
                    "xmin": 672,                    # Left
                    "ymin": 560,                    # Top
                    "xmax": 781,                    # Right
                    "ymax": 693                     # Bottom
                }, 
                "generated": 0,                     # 0 - the bounding box is manually labeled
                                                    # 1 - the bounding box is automatically generated by a tracker
                "tracker": "none"                   # If generated=1, it is one of "linear", "kcf" and "mosse"
            }, 
            ...
        ],
        ...
    ],
    "relation_instances": [                         # List of annotated visual relation instances
        {
            "subject_tid": 0,                       # Corresponding trajectory ID of the subject
            "object_tid": 1,                        # Corresponding trajectory ID of the object
            "predicate": "in_front_of", 
            "begin_fid": 0,                         # Frame index where this relation begins (inclusive)
            "end_fid": 210                          # Frame index where this relation ends (exclusive)
        }, 
        ...
    ]
}

Object Detector

We use off-the-shelf Faster R-CNN equipped with ResNet-101 backbone.

⚠️ Need to convert VidVRD/VidOR annotations into COCO format!
- For VidVRD, use vidvrd_anno_to_coco_format.py (will take a minute).
- For VidOR, use vidor_anno_to_coco_format.py (will take few minutes).

Multi Object Tracker

We use DeepSORT for multi-object tracking.

Relation Detection & Relation Tagging

Train

sh train.sh

Evaluation

sh eval.sh

Citation

@article{woo2021and,
  title={{What and When to Look?: Temporal Span Proposal Network for Video Visual Relation Detection}},
  author={Woo, Sangmin and Noh, Junhyug and Kim, Kangil},
  journal={arXiv preprint arXiv:2107.07154},
  year={2021}
}

Acknowledgement

We appreciate much the nicely organized codes developed by VidVRD-helper, detectron2, and DeepSORT. Our codebase is built mostly based on them.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
configs		configs
deep_sort		deep_sort
detectron		detectron
figures		figures
lib		lib
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
base.py		base.py
detect.sh		detect.sh
dist_train.sh		dist_train.sh
eval.sh		eval.sh
evaluate.py		evaluate.py
preprocess.sh		preprocess.sh
train.sh		train.sh
vidor_to_image.sh		vidor_to_image.sh
vidvrd_to_image.sh		vidvrd_to_image.sh
visualize.py		visualize.py

License

sangminwoo/Temporal-Span-Proposal-Network-VidVRD

Folders and files

Latest commit

History

Repository files navigation

Temporal Span Proposal Network for Video Relation Detection

Video Visual Realtion Detection (VidVRD) & Video Object Relation (VidOR)

Dataset & Annotations

Preprocessing videos into frames

Annotation format

Object Detector

Multi Object Tracker

Relation Detection & Relation Tagging

Citation

Acknowledgement

About

Topics

Resources

License

Stars

Watchers

Forks

Languages