Skip to content

Dataset and benchmark for indoor monocular depth estimation with focus on indoor space types

Notifications You must be signed in to change notification settings

DepthComputation/InSpaceType_Benchmark

Repository files navigation

InSpaceType: Reconsider Space Type in Indoor Monocular Depth Estimation

This repository includes codes for reproducing benchmark reuslts for the paper.

This work is presented in CoRL 2023 OOD Workshops

[Long version paper] [Supplementary Material] [Workshop Page] [Project Page]

Data

Sample data: This contains 167MB sample data

InSpaceType Eval set: This contains 1260 RGBD pairs for evaluation use about 11.5G

For evaluation, please donwload the eval set, extract under this root folder and rename it to 'InSpaceType'

Speficially, the data structure should be

InSpaceType
        |---- 0001.pfm
        |---- 0001_L.jpg
        |---- 0002.pfm
        |---- 0002_L.jpg
        |---- 0003.pfm
        |---- 0003_L.jpg
        |---- ...

SOTA update

Unidepth AbsRel SqRel RMSE delta_1 delta_2 delta_3
Private room 0.0609 0.0163 0.1805 96.94 99.50 99.89
Office 0.0788 0.0322 0.2608 93.25 98.44 99.64
Hallway 0.0871 0.0566 0.4608 90.83 97.14 99.10
Lounge 0.0860 0.0663 0.5312 93.24 98.39 99.41
Meeting Room 0.0764 0.0330 0.2992 94.66 99.12 99.79
Large Room 0.0871 0.0688 0.5370 94.10 98.46 99.26
Classroom 0.0601 0.0233 0.2589 96.67 99.52 99.93
Library 0.0800 0.0498 0.4505 93.63 98.27 99.40
Kitchen 0.0662 0.0235 0.2027 95.42 98.80 99.34
Playroom 0.0544 0.0182 0.1983 95.60 99.30 99.89
Living Room 0.0670 0.0312 0.2758 94.87 98.54 99.58
Bathroom 0.0765 0.0252 0.1600 96.40 98.00 98.52
Overall 0.0718 0.0349 0.3132 94.76 98.75 99.62

Summary: Unidepth is very high-performing with much higher performance than other on the zero-shot benchmark.

In short, Depth-Anything outperforms ZoeDepth and VPD in errors metrics and delta_3. Depth-Anything is also surprisingly a robust method across different space types, where errors and delta did not suffer from large variation across different space types. Compared with ZoeDepth, which has much lower performance (errors and delta_1) in large room, kitchen, and hallway, but Depth-Anything is quite steady in those evaluations

Depth-Anything AbsRel SqRel RMSE delta_1 delta_2 delta_3
Private room 0.0836 0.0297 0.2500 91.87 98.78 99.80
Office 0.0990 0.0454 0.313 88.12 97.61 99.45
Hallway 0.1074 0.0683 0.4726 87.02 96.55 99.05
Lounge 0.1053 0.0920 0.631 88.34 97.69 99.33
Meeting Room 0.0899 0.0462 0.3591 90.28 98.52 99.76
Large Room 0.1051 0.0974 0.686 88.61 97.91 99.27
Classroom 0.0870 0.0426 0.3465 90.24 98.61 99.81
Library 0.0947 0.0641 0.5150 90.08 98.09 99.43
Kitchen 0.0863 0.0390 0.2480 91.97 97.99 99.13
Playroom 0.0871 0.0416 0.3453 91.23 98.38 99.62
Living Room 0.0921 0.0472 0.3448 90.26 97.80 99.47
Bathroom 0.0860 0.0337 0.1965 92.71 99.72 98.76
Overall 0.0928 0.0506 0.3806 90.01 98.09 99.54
ZoeDepth (N) AbsRel SqRel RMSE delta_1 delta_2 delta_3
Private room 0.0798 0.0253 0.2218 93.86 99.44 99.90
Office 0.0978 0.0447 0.3102 90.07 97.80 99.52
Hallway 0.1193 0.0741 0.5271 85.84 96.95 99.10
Lounge 0.1172 0.1110 0.7033 86.61 97.02 99.11
Meeting Room 0.0887 0.0391 0.3379 93.55 99.05 99.79
Large Room 0.1565 0.1701 0.9157 77.19 94.95 99.33
Classroom 0.0719 0.0282 0.2873 95.37 99.43 99.91
Library 0.1163 0.0875 0.6274 87.34 96.93 98.99
Kitchen 0.1256 0.0589 0.2825 87.10 96.70 98.00
Playroom 0.0790 0.0300 0.2508 94.21 98.09 99.88
Living Room 0.0862 0.0390 0.3036 91.72 98.31 99.58
Bathroom 0.1065 0.0358 0.1880 93.64 97.64 98.17
Overall 0.0962 0.0533 0.3890 90.81 98.22 99.52
VPD AbsRel SqRel RMSE delta_1 delta_2 delta_3
Private room 0.0768 0.0325 0.2428 93.76 98.70 99.60
Office 0.1019 0.0548 0.3374 89.52 97.52 99.26
Hallway 0.1174 0.0828 0.5233 86.66 96.40 98.72
Lounge 0.1313 0.1286 0.7306 83.06 95.90 98.83
Meeting Room 0.0992 0.0568 0.3742 91.51 98.262 99.43
Large Room 0.1292 0.1460 0.7903 83.90 95.87 98.89
Classroom 0.0744 0.0344 0.3034 94.73 98.94 99.72
Library 0.1258 0.1311 0.6885 85.22 95.34 98.62
Kitchen 0.0958 0.0538 0.2469 92.56 96.36 98.64
Playroom 0.0735 0.0313 0.2574 95.17 98.38 99.83
Living Room 0.0910 0.0510 0.3490 91.45 97.59 99.25
Bathroom 0.0955 0.0472 0.2279 92.95 97.04 97.98
Overall 0.0983 0.0661 0.1412 90.23 97.48 99.22

Analysis tool

Go to each method subfolder (last updated: 12 recent SOTA methods, keep growing and updating with the trends) and see respective README instruction for evalution. All the execution results contain a pile of *.txt file. *-all means overall performance. H0-H2 means level of hierarchy. and H1_xx means scene space type number. See space_type_def.yml for space type number definition.

InSpaceType all data: This contains 40K RGBD pairs, about 500G the whole InSpaceType dataset. The whole data is split into 8 chunks. Please download all chunks in the folder and extract them under the root folder. Filepath reads are all written to be relative to the root path.

The data is indexed by <seq_num> folders. In each folder, it contains images and depth maps.

Analysis in a nutshell

Analysis I-II [Benchmark on overall performance and space type breakdown]:
The table shows challenging InSpaceType benchmark for the overall performance, following the major monocular depth estimation protocol and metrics. We adopt the following publicly released models trained on NYUv2 for evaluation. Recent work using larger-scale backbone models obtains lower error and higher accuracy.

Sample SpaceType breakdown is shown as follows. Different from conventional monocular depth estimation protocols, our work pioneers to study space type breakdown for a method. This provides a hint on how a method performs across different space types. From the following table one can observe the state-of-the-art models suffer from major performance imbalance issue. Both methods have similar easy and hard type which are potentially inherited from NYUv2. Directly deploying those SOTA models for in-the-wild application can add robustness concern.

Analysis III [Training datasets]:
In addition to NYUv2, we analyze other popular training datasets: Hypersim, SimSIN, UniSIN for indoor moncular depth estimation. One can find models trained on each training dataset also suffer from imbalance between space types, revealing their underlying bias. We also find that kitchen is a special type with lower RMSE but also very low accuracy. We assume this is because kitchen contains many cluttered small objects, such as bottles, kitchenware, and utensils in the near field. Current synthetic datasets (SimSIN, Hypersim) may not attain the same level of simulation complexity and thus training on them may not match the real-world complexity.

Analysis IV-V [Dataset fitting and bias mitigation]:
We first creat a larger InSpaceType training set and study a dataset fitting problem. This aims to show how each space type fits when training all the types together and attempts to observe concordance between types. From the table large room and lounge are large-size spaces and naturally result in slightly higher RMSE. It is worth noting that there is an apparent trend: for errors, larger rooms and longer ranges tend to have a higher estimation error; for accuracy, arbitrarily arranged small objects in the near field are challenging, a frequent scenario for kitchen.

We then study three different strategies to alleviate type imbalanceL meta-learning (ML), class reweighting (CR), and class-balance sampler (CBS). One can find CBS and ML are better strategies to attain lower standard deviation across types (t-STD) and better overall performance. Though CR attains lower t-STD, its overall performance drop as well. This is because CR could harm head-class performances as observed in literature.

Analysis VI [Intra-group generalization]
We next investigate generalization to unseen types. We divide the whole InSpaceType training set into different splits, train on each division, and then evaluate on InSpaceType eval split. The whole training set is divided into three groups based on similarity between types and concerns a situation where one collects training data almost in the same functionality that matches the primary application scenarios without considering different user scenarios. The left half shows generalization to other types, and the right half shows evaluation on different depth ranges. Training on specific groups can produce good performance on its dedicated types. However, one can observe training on only some types encounters severe issues in generalization to other unseen types, which further reveal high variation between different indoor environments, and pretrained knowledge on some types may not easily transfer to other types.

Conclusion:
Unlike previous methods that focus on algorithmic developments, we are the first work to consider space types in indoor monocular depth estimation for robustness and practicability in deployment. We point out limitations in previous evaluations where performance variances across types are overlooked and present a novel dataset, InSpaceType, along with a hierarchical space type definition to facilitate our study. We give thorough studies to analyze and benchmark performance based on space types. Ten high-performing methods are examined, and we find they suffer from severe performance imbalance between space types. We analyze a total of 4 training datasets and enumerate their strength and weakness space types. 3 popular strategies, namely, class reweighting, type-balanced sampler, and meta-learning, are studied to mitigate imbalance. Further, we find generalization to unseen space types challenging due to high diversity of objects and mismatched scales across types. Overall, this work pursues a practical purpose and emphasizes the importance of this usually overlooked factor- space type in indoor environments. We call for attention to safety concerns for model deployment without considering performance variance across space types.

Sample heirarchy labeling and breakdown

Please refer to the paper and the supplementary for the full results.

Bibtex

If you find our work useful, please consider to cite our work

@article{wu2023inspacetype,
      title={InSpaceType: Reconsider Space Type in Indoor Monocular Depth Estimation},
      author={Wu, Cho-Ying and Gao, Quankai and Hsu, Chin-Cheng and Wu, Te-Lin and Chen, Jing-Wen and Neumann, Ulrich},
      journal={arXiv preprint arXiv:2309.13516},
      year={2023}
    }

License

The dataset is CC BY-SA 4.0 licensed.