Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cyclic window shifting in the (256,256) tensor #47

Open
tsw123tsw opened this issue Sep 13, 2023 · 1 comment
Open

cyclic window shifting in the (256,256) tensor #47

tsw123tsw opened this issue Sep 13, 2023 · 1 comment

Comments

@tsw123tsw
Copy link

tsw123tsw commented Sep 13, 2023

Hi,
Awesome repo. I have a question regarding the architecture are token interaction. Don't you think the way HTSAT creates (256, 256) tensor from (1024,64) spectrogram causes problematic token interaction when cyclic window shifting?
What I understood is that you cut the spectrogram (1024,64) into 4 PIECES along dim=0 (256,64) each. Later these 4 are concatenated along dim=1 resulting in the final tensor of shape (256,256). On this when you do cyclic window shift, results in window comprises of tokens from two different PIECES, that is some from low-frequency region of PIECE 1 and some from high frequency region from PIECE 2.

@RetroCirce
Copy link
Owner

Hi,

Sorry for the late reply, I was busy with other stuff this quarter.
You ask a good question that if the cyclic window shifting will cause the overlapping or wrong information loss of the feature during the downsampling process.

My answer is not. And it is not because we downsample it by 2 each time and with 2 x 2 x 2 = 8 three times in total. In this case, all features "at the edge" of each (256, 64) piece will not share any wrong information into other piece because the downsample rate 8 is not big enough to compress (256, 64) into (1,1). This is considered when I did the project and that is why I think it would be not a problem.

But if you increase your downsample rate, it should be a problem. In this case, I think giving up converting from (1024, 64) to (256, 256) and directly process (1024, 64) shape will be an option. In all, the reason why I do the conversion is because we need to use the swin-transformer checkpoint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants