Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dimensions of extracted features #53

Open
kaiqiangh opened this issue Nov 14, 2019 · 2 comments
Open

Dimensions of extracted features #53

kaiqiangh opened this issue Nov 14, 2019 · 2 comments

Comments

@kaiqiangh
Copy link

Hi, thanks for sharing codes.
Two questions here:

  1. I extracted video features by using this pre-trained model (resnet-34-kinetics-cpu.pth) and I checked the outputs that the dimension of extracted features for each segment (16 frames) is 512 dims. However, in your paper, for this model, it should be 512/2=256 dims after global average pooling. Please correct me if I am wrong.

  2. For the pre-trained models provided by you, there are "resnet-34-kinetics-cpu.pth" and "resnext-101-kinetics.pth". I would ask - why is the latter model size smaller than the former's? To my understanding, the latter model should have more parameters to be trained (more filters/feature channels).

Looking forward to your reply. Thanks in advance.

@kaiqiangh
Copy link
Author

I also checked "resnext-101-kinetics.pth", the output feature dimension is 2048. Is it after global average pooling or before it?

@kaiqiangh
Copy link
Author

kaiqiangh commented Nov 18, 2019

I found something. In ResNeXt-101 model, it seems like (conv, 1x1x1, F) --> (conv, 3x3x3, F, group=32) --> (conv, 1x1x1, 2F) (ignore BN and ReLU). The output dimension of downsample block is 1024-D (F=1024). However, one more conv layer is followed with double dimension to downsample block output, which means final output dimension before global average pooling (GAP) layer is 2F (2*1024) in this case. And GAP will not change feature dimensions.

So, by extracting video features from resnext-101 model, the output dimensions after GAP is 2048-D.

One more thing. In author's readMe file, say that "In the feature mode, this code outputs features of 512 dims (after global average pooling) for each 16 frames.". This case is for ResNet basic version regardless the number of layers.

Please anyone corrects me if i am wrong. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant