Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Learnable Align Attention Implementation #294

Open
OrcunCanDeniz opened this issue Jun 28, 2022 · 1 comment
Open

Learnable Align Attention Implementation #294

OrcunCanDeniz opened this issue Jun 28, 2022 · 1 comment

Comments

@OrcunCanDeniz
Copy link

In the DeepFusion paper it was said that

For each query (i.e., voxel cell), we conduct inner product between the query and the keys to obtain the attention affinity matrix that contains 1 × N correlations between the voxel and all its corresponding N camera features.

So I think this should lead to V x N correlations for V voxel cells and if we consider batches BxVxN. However in the implementation affinity = tf.einsum('bnc,bnc->bn', q, k) produces BxN shaped tensor. I feel like this should be affinity = tf.einsum("bij,bkl->bik",q,k). I couldnt manage to wrap my head around this, what am I missing?

Finally, thanks to the team for this great work.
@LiYingwei

@zlenyk
Copy link

zlenyk commented Jul 1, 2022

It sounds like voxels that they are talking about are in fact pillars with 1 per bev grid, but I'm not 100% sure.
Another interesting question is what is the definition of "corresponding N camera features" - do you know which camera points are considered for given lidar feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants