Skip to content

Latest commit

 

History

History
32 lines (20 loc) · 2.5 KB

depth_coeff.md

File metadata and controls

32 lines (20 loc) · 2.5 KB

December 2019

tl;dr: Encode depth in a simplified one-hot encoding (DC) and cross entropy loss reduces over-smoothing in depth estimation.

Overall impression

Similar to the idea of SMWA to address the "long tail" problem. This problem can be also referred to as edge bleeding, over-smoothing, or mixed depth. It features spurious depth estimation in mid-air and connecting surfaces between separate objects.

DC focuses on depth completion while SMWA focuses on depth estimation from stereo pairs.

It also acknowledges that the problem is a multi-modal problem and using L1 or L2 leads to spurious estimation in-between modes. --> this is also used in generalized focal loss to model multi-modal distribution of edges of heavily occluded objects.

The idea of using an N-channel but 3-hot depth encoding is similar to the soft one-hot encoding used in SMWA. In SMWA it also uses cross entropy for regression. DC gives a better explanation why cross entropy is a better loss than L1 or L2.

The input and loss modification is based on sparse-to-dense and is easy to implement.

Key ideas

  • One-hot encoding of depth and the use of cross-entropy loss solves the problem of mixed-depth problem.
    • direct one-hot encoding may leads to too sparse depth samples, and thus intentional information leaking by (gaussian) blurring across depth direction increases samples for convolution.
  • Cross entropy loss for depth bin j and pixel i. For each pixel i, only 3 pixels are with non-zero $c_{ij}$. This is similar to the idea of nll loss used in depth from one line. $$L^{ce}(c_{ij}) = -\sum_{j=1}^N c_{ij}\log\tilde{c_{ij}}$$
  • RMSE favors over-smoothed depth estimation and thus is not a reliable metric.

Technical details

  • Depth reconstruction: either weighted average, or pick the single modal weighted average (eq 7). --> However the paper did not go to details on this.
  • The output dense depth leads to improved lidar performance. --> this is to be compared with pseudo lidar e2e which suffers from long tail problem.

Notes

  • But after thinking about this again: how does changing the one-hot encoding into soft one-hot encoding help in alleviating the problem? How does cross entropy come to rescue when N degenerates to 1? Then it becomes softmax loss. --> cross entropy enables multi-modal?