Learnable Align Attention Implementation #294

OrcunCanDeniz · 2022-06-28T22:57:29Z

In the DeepFusion paper it was said that

For each query (i.e., voxel cell), we conduct inner product between the query and the keys to obtain the attention affinity matrix that contains 1 × N correlations between the voxel and all its corresponding N camera features.

So I think this should lead to V x N correlations for V voxel cells and if we consider batches BxVxN. However in the implementation affinity = tf.einsum('bnc,bnc->bn', q, k) produces BxN shaped tensor. I feel like this should be affinity = tf.einsum("bij,bkl->bik",q,k). I couldnt manage to wrap my head around this, what am I missing?

Finally, thanks to the team for this great work.
@LiYingwei

The text was updated successfully, but these errors were encountered:

zlenyk · 2022-07-01T10:44:13Z

It sounds like voxels that they are talking about are in fact pillars with 1 per bev grid, but I'm not 100% sure.
Another interesting question is what is the definition of "corresponding N camera features" - do you know which camera points are considered for given lidar feature?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Learnable Align Attention Implementation #294

Learnable Align Attention Implementation #294

OrcunCanDeniz commented Jun 28, 2022

zlenyk commented Jul 1, 2022

Learnable Align Attention Implementation #294

Learnable Align Attention Implementation #294

Comments

OrcunCanDeniz commented Jun 28, 2022

zlenyk commented Jul 1, 2022