Skip to content

ArpitaSTugave/Depth-Estimation-using-CNN

Repository files navigation

Depth Estimation using Data Driven Approaches

Introduction

     Time of Flight, Structured light and Stereo technology have been used widely for Depth Map estimation. Each have these come with their own pros and cons in terms of speed of image capture, structural description and ambient light performance. Monocular cues such as: Texture and Gradient Variation, Shading , color/Haze, and defocus aid in accurate depth estimation. These are complex statistical models which are susceptible to noise. Recently, data driven approaches as in deep learning has been employed for depth estimation. These data driven approaches are less prone to noise if presented with enough data to learn coarser and finer details.

Convolution Neural Networks - CNN

     In deep learning, CNNs are widely used in the image processing applications. Convolution layers are the basic building block of CNN and it combines with Pooling and ReLU activation layers. Kernel learns during each layer using back propagation.The CNN learns the features from the input images by applying the varied filters across the image generating feature maps at each layer. As we go deeper into the network the feature maps are able to identify complex features and objects intuitively. ConvNets have been very successful for image classification, but recently have been used for image prediction and other applications. The addition of upscaling and deconvolution layers have given way to upscale the compressed feature map for data prediction over class.

image

Related Work

     A fully automatic 2D-to-3D conversion algorithm: Deep3D [1] that takes 2D images or video frames as input and outputs 3D stereo image pairs. David Eigen from NYU proposed a single monocular image based architecture that employs two deep network stacks called Multi Scale Network [2]: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. It is trained on real world dataset. “FlowNet: Learning Optical Flow with Convolutional Networks” [3] uses video created virtually to make the network learn motion parameters and hence forth extract optical flow. “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches” [4] a method for extracting depth information from stereo data and their respective patches. Similar to [4] “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical {CRFs}” [5] uses different scale of image patches to extract depth information.

image

Multi Scale network

image

image

FlowNet

Methods

A.   Stereo ConvNet Architecture

     The images and ground truth depth maps used for training, validation and testing are produced by varying orientations of the 3D model generated using the Blender software tool. As our first step, we use SteroConvNet [6] and the first half of the network is shown below. Second half of the network is the mirror image of the last convolution layer, replacing convolution with deconvolution and pooling with upscaling. Input Image, even though consists of concatenated left and right image pairs , the network takes it as two separate images. Here, the reference output label is the ground truth depth map generated using the Blender's "Mist" function.

image

Stereo ConvNet Architecture

B.   Deeper Stereo ConvNet Architecture

     In Deeper Stereo ConvNet, input remains constant but architecture is modified with an extra convolution and deconvolution layer. Also, depth of the filters is increased referring to [3] in order to capture more details.

C.   Patched Deeper Stereo ConvNet Architecture

     Referring to [4] and [5], input stream has been increased to 6 for Patched Deeper Stereo ConvNet, by decomposing left image into 4 scaled parts. Thus, as in the referenced papers higher accuracy of the depth map is expected.

image

Patched Deeper Stereo ConvNet Architecture

Results

      Stereo ConvNet Architecture
          + smooth without holes
          + coarse structure preserved
          -Blurred at edges
          -Sharp structures lost
          -Fine objects smeared or lost.
          Time to test = 20 s

      Deeper Stereo ConvNet Architecture
          + smooth without holes
          + coarse structure preserved
          + Edges are sharper
          -Still noise at the edges
          -Fine details/objects smeared or lost.
          Note:The increased depth of the network learns more detail about the scene.
          Time to test = 70 s

      Patched Deeper Stereo ConvNet Architecture
          + smooth without holes
          + Fine structure preserved
          + Image predicted with less noise.
          -Time to train and test increases.
          Note:The increased depth and increased data resolution of the network learns more
          detail about the scene.
          Time to test = 145 s

Stereo ConvNet Architecture:

image

Deeper Stereo ConvNet Architecture:

image

Patched Deeper Stereo ConvNet Architecture:

image

3D modeling for Patched Deeper Stereo ConvNet Architecture:

Image Expected output Derived output
1_s 2_s 3_s
4_s 5_s 6_s

Conclusion

      Data Driven Depth Estimation approaches would be effective if sufficiently large descriptive labelled dataset were avialable. Patched Deeper Stereo ConvNet predicts depth map very similar to the ground truth. Time to train the network is directly proportional to the depth and complexity of the CNN architecture. In further implementations, we plan to combine the architecture of our Patched Deeper StereoConvNet with Multi-Scale Deep Network and observe the results for real world images.

References

[1]     “Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks” Junyuan Xie, Ross Girshick, Ali Farhadi,University of Washington.

[2]     “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network” David Eigen, Christian Puhrsch, Rob Fergus Dept. of Computer Science, Courant Institute, New York University.

[3]      “FlowNet: Learning Optical Flow with Convolutional Networks”, A. Dosovitskiy and P. Fischer, ICCV , 2015.

[4]     “Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs” by Bo Li1, Chunhua Shen , Yuchao Dai , Anton van den Hengel, Mingyi He, IEEE Conference on Computer Vision and Pattern Recognition (CVPR'15).

[5]     “Stereo Matching by Training a Convolutional Neural Network to Compare Image Patches” by Jure Zbontar ,University of Ljubljana Vecna ,Yann LeCun, Journal of Machine Learning Research 17 (2016).

[6]     https://github.com/LouisFoucard/StereoConvNet

About

Simple depth maps obtained by using CNN

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published