ActionRecognition/experiments.md at master · woodfrog/ActionRecognition · GitHub

Record some experiment results here

Pipeline

Preprocessing

Download dataset: UCF101 http://crcv.ucf.edu/data/UCF101.php
Extract video to 5 FPS and down sample resolution for each video and discard videos with too few frames
Segment number of frames into equal size blocks(frame number/sequence length L). Randomly select one frame from each block to compose L length video clip
Load videos in batch and do mean substraction for each video
Feed preprocessed videos into CNN and train on RNN with LSTM network

Experiments

Try ConvLSTM. Train from sratch, too slow and seems no effect with acc around 0.01, the probability of random guess.
Seperate Inception and LSTM, and only train LSTM. Loss stops dropping after 100 epochs.
Train Inception using video frame data. And then only train LSTM using output of Inception. Finding loss not decreasing, try data normalization(/255). Still fail, so we guess this may because CNN con only recognize the rough outline of an object, but can not tell the small difference of what that obejct is doing.
Use small CNN and RNN and train them seperately. To prevent overfitting, add regularizer in FC layer, dropout layer with 0.5 dropping rate, setting checkpoint to save the weights when validation accuracy reaches highest. Tranining small CNN hundreds of epochs make trainning acc 98, testing acc 28. Still heavy overfitting. Training RNN, training acc stops at 0.33, validation acc stops at 0.17
Build a more complex CNN model according (two stream ...), try to extracts all frames to get more data, but fail to fit the data in disk memory.
Simply using a combination of Resnet as feature extractor and one layer lstm gives val acc of 0.67
Regenerate data with mean subtraction and normalization.Fine tune ResNet with one more FC layer and get val acc of 0.59. Using a combination of finetuned ResNet and lstm shows no improvement. This indicates that RNN is not so useful in action recognition, since it only keeps incoming information in its state but does not explicit operations on coming sequence. In other words, action recognition is not a task that depends on long term dependence so much, but instead, it needs more explicit information like optical flow
Try ConvNet using optical flow as input, and get val acc of 0.21. Using continuous sequential data, higher drop rate (0.5 to 0.9) and recalculating optical flow data when val loss stops dropping gives a higher val acc 0.42
Fine tune ResNet50 on all video frames and get val acc of 0.65
Fine tune ResNet50 on 3 channel optical flow and get val acc of 0.55
Run two stream using spatial and temporal finetuned ResNet and get val acc of 0.59
Run two stream using spatial ResNet and temporal CNN and get val acc of

Methods to try:

Data augumentation,(random sampling, fliiping and jittering), generate different preprocessed data (done)
Using continuous frames as input (done)
Multi-task learning: combine different databases using two different softmax and shared weights