Skip to content

Latest commit

 

History

History
29 lines (20 loc) · 1.52 KB

File metadata and controls

29 lines (20 loc) · 1.52 KB

September 2021

tl;dr: Transformers to lift image to BEV.

Overall impression

This paper uses a cross-attention transformer structure (although they did not spell that out explicitly) to lift image features to BEV and perform road layout and vehicle segmentation on it.

It is difficult for CNN to fit a view projection model due to the locally confined receptive fields of convolutional layers. Transformers are more suitable to do this job due to the global attention mechanism.

Road layout provides the crucial context information to infer the position and orientation of vehicles. The paper introduces a context-awre discriminator loss to refine the results.

Key ideas

  • CVP (cycled view projection)
    • 2-layer MLP to project image feature X to BEV feature X', following VPN
    • Add cycle consistency loss to ensure the X' captures most information
  • CVT (cross view transformer)
    • X' as Query, X/X'' as key/value
  • Context-aware Discriminator. This follows MonoLayout but takes it one step further.
    • distinguish predicted and gt vechiles
    • distinguish predicted and gt correlation between vehicle and road

Technical details

  • Summary of technical details

Notes