Spatial Temporal Transformer Network

Introduction

This repository contains the implementation of the model presented in the following paper:

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, Chiara Plizzari, Marco Cannici, Matteo Matteucci, ArXiv

Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition, Chiara Plizzari, Marco Cannici, Matteo Matteucci, Pattern Recognition. ICPR International Workshops and Challenges, 2021, Proceedings

Skeleton-based action recognition via spatial and temporal transformer networks, Chiara Plizzari, Marco Cannici, Matteo Matteucci, Computer Vision and Image Understanding, Volumes 208-209, 2021, 103219, ISSN 1077-3142, CVIU

Visualizations of Spatial Transformer logits

The heatmaps are 25 x 25 matrices, where each row and each column represents a body joint. An element in position (i, j) represents the correlation between joint i and joint j, resulting from self-attention.

Prerequisites

Python3
PyTorch
All the libraries in requirements.txt

Run mode

 python3 main.py

Training: Set in /config/st_gcn/nturgbd/train.yaml:

Training: True

Testing: Set in /config/st_gcn/nturgbd/train.yaml:

Training: False

Data generation

We performed our experiments on three datasets: NTU-RGB+D 60, NTU-RGB+D 120 and Kinetics.

NTU-RGB+D

The data can downloaded from their website. You need to download 3D Skeletons only (5.8G (NTU-60) + 4.5G (NTU-120)). Once downloaded, use the following to generate joint data for NTU-60:

 python3 ntu_gendata.py

If you want to generate data and preprocess them, use directly:

 python3 preprocess.py

In order to generate bones, you need to run:

 python3 ntu_gen_bones.py

The joint information and bone information can be merged through:

 python3 ntu_merge_joint_bones.py

For NTU-120, the samples are divided between training and testing in a different way. Thus, you need to run:

 python3 ntu120_gendata.py

If you want to generate data and process them directly, use:

 python3 preprocess_120.py

Kinetics

Kinetics is a dataset for video action recognition, consisting of raw video data only. The corresponding skeletons are extracted using Openpose, and are available for download at GoogleDrive (7.5G). From raw skeletons, generate the dataset by running:

 python3 kinetics_gendata.py

Spatial Transformer Stream

Spatial Transformer implementation corresponds to ST-TR/code/st_gcn/net/spatial_transformer.py. Set in /config/st_gcn/nturgbd/train.yaml:

attention: True
tcn_attention: False
only_attention: True
all_layers: False

to run the spatial transformer stream (S-TR-stream).

Temporal Transformer Stream

Temporal Transformer implementation corresponds to ST-TR/code/st_gcn/net/temporal_transformer.py. Set in /config/st_gcn/nturgbd/train.yaml :

attention: False
tcn_attention: True
only_attention: True
all_layers: False

to run the temporal transformer stream (T-TR-stream).

To merge S-TR and T-TR (ST-TR)

The score resulting from the S-TR stream and T-TR stream are combined to produce the final ST-TR score by:

  python3 ensemble.py

Adaptive Configuration (AGCN)

In order to run T-TR-agcn and ST-TR-agcn configurations, please set agcn: True.

Different ST-TR configurations

Set in /config/st_gcn/nturgbd/train.yaml:

only_attention: False, to use ST-TR as an augmentation procedure to ST-GCN (refer to Sec. V(E) "Effect of Augmenting Convolution with Self-Attention")
all_layers: True, to apply ST-TR on all layers, otherwise it will be applied from the 4th layer on (refer to Sec. V(D) "Effect of Applying Self-Attention to Feature Extraction")
Set both attention: True and tcn_attention: True to combine both SSA and TSA on a unique stream (refer to Sec. V(F) "Effect of combining SSA and TSA on one stream")
more_channels: True, to assign to each head more channels than dk/Nh.
n: used if more_channels is set to True, in order to assign to each head dk*num/Nh channels

To set the block dimensions of the windowed version of Temporal Transformer:

dim_block1, dim_block2, dim_block3, respectively to set block dimension where the output channels are equal to 64, 128 and 256.

Second order information

Set in /config/st_gcn/nturgbd/train.yaml:

channels: 6 , because on channels dimension we have both the coordinates of joint (3), and coordinates of bones(3)
double_channel: True , since in this configuration we also doubled the channels in each layer.

Pre-trained Models

Please notice I have attached pre-trained models of the configurations presented in the paper in the checkpoint_ST-TR folder. Please note that the *bones*.pth configurations correspond to the models trained with joint+bones information, while the others are trained with joints only.

Citation

Please cite one of the following papers if you use this code for your researches:

@article{plizzari2021skeleton,
  title={Skeleton-based action recognition via spatial and temporal transformer networks},
  author={Plizzari, Chiara and Cannici, Marco and Matteucci, Matteo},
  journal={Computer Vision and Image Understanding},
  volume={208},
  pages={103219},
  year={2021},
  publisher={Elsevier}
}

@inproceedings{plizzari2021spatial,
  title={Spatial temporal transformer network for skeleton-based action recognition},
  author={Plizzari, Chiara and Cannici, Marco and Matteucci, Matteo},
  booktitle={Pattern Recognition. ICPR International Workshops and Challenges: Virtual Event, January 10--15, 2021, Proceedings, Part III},
  pages={694--701},
  year={2021},
  organization={Springer}
}

Contact 📌

If you have any question, do not hesitate to contact me at chiara.plizzari@mail.polimi.it. I will be glad to clarify your doubts!

_{Note: we include LICENSE, LICENSE_1 and LICENSE_2 in this repository since part of the code has been derived respectively
from https://github.com/yysijie/st-gcn, https://github.com/leaderj1001/Attention-Augmented-Conv2d
and https://github.com/kenziyuliu/Unofficial-DGNN-PyTorch/blob/master/README.md}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

additional_files

additional_files

checkpoint_ST-TR

checkpoint_ST-TR

code

code

LICENSE

LICENSE

README.md

README.md

requirements.txt

requirements.txt

Repository files navigation

Spatial Temporal Transformer Network

Introduction

Visualizations of Spatial Transformer logits

Prerequisites

Run mode

Data generation

NTU-RGB+D

Kinetics

Spatial Transformer Stream

Temporal Transformer Stream

To merge S-TR and T-TR (ST-TR)

Adaptive Configuration (AGCN)

Different ST-TR configurations

Second order information

Pre-trained Models

Citation

Contact 📌

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 120 Commits
additional_files		additional_files
checkpoint_ST-TR		checkpoint_ST-TR
code		code
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

Chiaraplizz/ST-TR

Folders and files

Latest commit

History

Repository files navigation

Spatial Temporal Transformer Network

Introduction

Visualizations of Spatial Transformer logits

Prerequisites

Run mode

Data generation

NTU-RGB+D

Kinetics

Spatial Transformer Stream

Temporal Transformer Stream

To merge S-TR and T-TR (ST-TR)

Adaptive Configuration (AGCN)

Different ST-TR configurations

Second order information

Pre-trained Models

Citation

Contact 📌

About

Resources

License

Stars

Watchers

Forks

Languages