Video Transformer Network (https://arxiv.org/abs/2102.00719) #388

bomri · 2021-03-24T09:38:55Z

VTN model setup
add the ability to return the entire video
add support to return the frames index
update defaults
VIT_B_VTN.yaml
adjusting the if-else in pack_pathway_output
VTN README.md + update main README.md + update MODEL_ZOO

- VTN model setup - add the ability to return the entire video - add support to return the frames index - update defaults - VIT_B_VTN.yaml - adjusting the if-else in pack_pathway_output - VTN README.md + update main README.md + update MODEL_ZOO

bomri · 2021-03-31T17:31:49Z

Hi @feichtenhofer, we recently published our work on video action recognition using Transformers (https://arxiv.org/abs/2102.00719). As PySlowFast aims to provide novel research implementations in this domain, we modified our codebase and models to make them available via this repository. We'd appreciate it if you could consider merging our pull request, we think it would be great to share it here with the community.

devksingh4 · 2021-03-31T17:34:29Z

+1, we would also appreciate the inclusion of this model in PySlowFast.

Isminoula · 2021-03-31T17:52:23Z

+1 would be great to have this model as a backbone for experiments, thank you!

feichtenhofer · 2021-04-16T00:35:16Z

Hi @bomri thanks for this pull request, and glad PySlowFast is of help for your research. We would need to do a careful review before merging this, because it adds some nontrivial overhead to the main logics, especially as it adds several functionalities and configurations to the core PySF code.

Generally, we would prefer if you could use a fork and we can re-link the implementation, similar as external projects are linked in detectron2 https://github.com/facebookresearch/detectron2/tree/master/projects#external-projects.

Related to this, we will be updating the codebase with some ViT baselines from a concurrent work around next week which should hopefully provide one more base for future work on video transformers

I'm adding @haooooooqi here for further help on this pull request

bomri · 2021-04-25T06:45:08Z

Thank you @feichtenhofer for your response.
We tried keeping the changes to the minimum needed to support our approach and only add missing functionalities, like processing full video at inference and fetching the relevant frame index for positional embedding.
If we can make any adjustments, please let me know.
If you prefer using external projects, can you please link our fork at https://github.com/bomri/SlowFast

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Video Transformer Network (https://arxiv.org/abs/2102.00719) #388

Video Transformer Network (https://arxiv.org/abs/2102.00719) #388

bomri commented Mar 24, 2021

bomri commented Mar 31, 2021

devksingh4 commented Mar 31, 2021

Isminoula commented Mar 31, 2021

feichtenhofer commented Apr 16, 2021

bomri commented Apr 25, 2021

Video Transformer Network (https://arxiv.org/abs/2102.00719) #388

Are you sure you want to change the base?

Video Transformer Network (https://arxiv.org/abs/2102.00719) #388

Conversation

bomri commented Mar 24, 2021

bomri commented Mar 31, 2021

devksingh4 commented Mar 31, 2021

Isminoula commented Mar 31, 2021

feichtenhofer commented Apr 16, 2021

bomri commented Apr 25, 2021