Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Need Help, very bad performance with my own dataset #130

Open
thunder95 opened this issue Mar 8, 2021 · 8 comments
Open

Need Help, very bad performance with my own dataset #130

thunder95 opened this issue Mar 8, 2021 · 8 comments
Assignees

Comments

@thunder95
Copy link

Halloechen~~~, many thanks for this great project in Advance.

When I tried this project with my own dataset, it shows very bad performance. The task is addressed to classify the body rotation (min 3* turn left and turn right)【A】and other body pose 【B】.

I followed this tutorial in the readme step by step, the accuracy stepped to 100% both in train and valid very immediately. When I checked with run_custom_classifier.py , the results would be that it showed A when my body even did very tiny movement, so B appears quite rarely.

The data samples in both labels are almost balanced. The average length of video clips is around 10 seconds.

what would be the root of my problem? how can I fix it? In this case, which parameters are needed to modified.

Thank you guys.

@thunder95
Copy link
Author

Possibly this work does not suite to long range video classification. The video clip consists of only 4 consecutive frames, which cannot resolve the long range action task. If I understand incorrectly, please give me more hints.

@corneliusboehm
Copy link
Contributor

Hi @thunder95, thanks a lot for your interest in our project!

What we usually do is adding a neutral background class, which in your case would be no rotation. Especially for testing the model live this could be helpful, because not every input might clearly fall into one of your two pre-defined classes. Let me know if that helps.

Concerning length of video clips: Our model has a temporal field of a few seconds, so it can detect actions that span some time. Often it also needs a few frames to make an accurate prediction. What exactly do you mean by "The video clip consist of only 4 consecutive frames"? Your clips in training were 10 seconds long, right?

@thunder95
Copy link
Author

Hi @thunder95, thanks a lot for your interest in our project!

What we usually do is adding a neutral background class, which in your case would be no rotation. Especially for testing the model live this could be helpful, because not every input might clearly fall into one of your two pre-defined classes. Let me know if that helps.

Concerning length of video clips: Our model has a temporal field of a few seconds, so it can detect actions that span some time. Often it also needs a few frames to make an accurate prediction. What exactly do you mean by "The video clip consist of only 4 consecutive frames"? Your clips in training were 10 seconds long, right?

Hallo @corneliusboehm , no rotation (other body bose) has already been taken as this "neutral background class". Eventually my problem description confused you. This way did not help.
.
├── checkpoints
│   ├── best_classifier.checkpoint
│   ├── confusion_matrix.npy
│   ├── confusion_matrix.png
│   ├── label2int.json
│   └── last_classifier.checkpoint
├── features_train_num_layers_to_finetune=9
│   ├── norm
│   └── turn_chair
├── features_valid_num_layers_to_finetune=9
│   ├── norm
│   └── turn_chair
├── project_config.json
├── videos_train
│   ├── norm
│   └── turn_chair
└── videos_valid
├── norm
└── turn_chair

With "The video clip consist of only 4 consecutive frames", I mean the step_size or model_fps, which is the frame amount feed into the classifier. That also means at each time only 4 consecutive frames determines the classification result, so I dont think in this way long range video task could be handled.

I also checked that even the amount of extracted features (first dim of npy data) were more than 50, then only 5 of them were considered as input dataset for training classifier.

At this moment I havnot tired Temporal Annotations yet.

@thunder95
Copy link
Author

no better results when I test it live with my usb camera.

@corneliusboehm
Copy link
Contributor

Thanks for the clarification on the problem description. Could you maybe share an example clip of what you're trying to classify, so we can get an even better idea?

Our model consumes frames at 16 fps while outputting predictions at 4 fps, which means that every four input frames there will be a new prediction. However, previous frames (or their features) will still be kept in memory for some time and used for future predictions, so the model should very well be able to classify actions that span across a larger number of input frames. How long do you want your "long range video task" to be?

@thunder95
Copy link
Author

Thanks for the clarification on the problem description. Could you maybe share an example clip of what you're trying to classify, so we can get an even better idea?

Our model consumes frames at 16 fps while outputting predictions at 4 fps, which means that every four input frames there will be a new prediction. However, previous frames (or their features) will still be kept in memory for some time and used for future predictions, so the model should very well be able to classify actions that span across a larger number of input frames. How long do you want your "long range video task" to be?

@corneliusboehm Hi, thanks for your kindly and fast reply.

With “long range video task“, I mean 10 seconds and 30fps frames, finally about 300 frames. After sampled this video uniformly, then I got 160 frames. In training phase, there are only 5 frames or features feed into training, because of the size of num_timesteps (equals 5). so if I understand correctly, the other frames or features are discarded, only 5 were kept. should I enlarge this num_timesteps ?

w.r.t. "previous frames (or their features) will still be kept in memory for some time", I only saw some prediction cache in the buffer of PostProcesser. If I set smoothing=4, then the buffer size is supposed to be 4. What I am missing there?

My current task is addressed to detect the forbidden behavior in office, namely "rotate chair at least 3 times". I will also do some other similar tasks recently (such as writing, violence detection).The sample videos were attached here.

So the video range of this specific task covers 8-10 seconds. Training dataloader took only 5 num_timesteps at each time, so I think it wont detect long range frames(such as 80frames, 5 seconds). The infer clip took only 4 frames, even additional with 4 buffer, which also cannot deal with this task.

sample_video.zip

@corneliusboehm corneliusboehm self-assigned this Mar 10, 2021
@atheeraa
Copy link

This is unrelated, but I'm trying to train a model with my own dataset, but unable to get the pretrained weights from jester because the link is down, is there any way you could please share the pretrained weights with me?

@linnaok
Copy link

linnaok commented Nov 2, 2022

Hi, I'm new to this project, but I do not have the weights and I am unable to visit the site provided for downloading weights. Could you please share the weights with me? Thank you very much!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants