Skip to content

Multi-Modal scene recognition using multi-scale encoded features (IMAVIS 2022)

License

Notifications You must be signed in to change notification settings

acaglayan/MMSNet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MMSNet: Multi-Modal scene recognition using multi-scale encoded features

This repository provides the implementation of the following paper:

MMSNet: Multi-Modal scene recognition using multi-scale encoded features
Ali Caglayan *, Nevrez Imamoglu *, Ryosuke Nakamura
[Paper]


Graphical abstract

Requirements

Before starting, it is required to install the following libraries. Note that the package versions might need to be changed depending on the system:

conda create -n mmsnet python=3.7
conda activate mmsnet
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
pip install -U scikit-learn
pip install opencv-python
pip install psutil
pip install h5py

Also, source code path might need to be included to the PYTHONPATH (e.g. export PYTHONPATH=$PYTHONPATH:/path_to_project/MMSNet/src/utils).

Data Preparation

SUN RGB-D Scene

SUN RGB-D Scene dataset is the largest real-world RGB-D indoor dataset as of today. Download the dataset from here, keep the file structure as is after extracting the files. In addition, allsplit.mat and SUNRGBDMeta.mat files need to be downloaded from the SUN RGB-D toolbox. allsplit.mat file is under SUNRGBDtoolbox/traintestSUNRGBD and SUNRGBDMeta.mat is under SUNRGBDtoolbox/Metada. Both files should be placed under the root folder of SUN RGB-D dataset. E.g. :

SUNRGBD ROOT PATH
├── SUNRGBD
│   ├── kv1 ...
│   ├── kv2 ...
│   ├── realsense ...
│   ├── xtion ...
├── allsplit.mat
├── SUNRGBDMeta.mat

The dataset is presented in a complex hierarchy. Therefore, it's adopted to the local system as follows:

python utils/organize_sunrgb_scene.py --dataset-path <SUNRGBD ROOT PATH>

This creates train/eval splits, copies RGB and depth files together with camera calibration parameters files for depth data under the corresponding split structure. Then, depth colorization is applied as below, which takes a couple of hours.

python utils/depth_colorize.py --dataset "sunrgbd" --dataset-path <SUNRGBD ROOT PATH>

NYUV2 RGB-D Scene

NYUV2 RGB-D Scene dataset is available here. In addition, splits.mat file needs to be downloaded from here together with sceneTypes.txt from here. The dataset structure should be something like below:

NYUV2 ROOT PATH
├── nyu_depth_v2_labeled.mat
├── splits.mat
├── sceneTypes.txt

Unlike other datasets, NYUV2 dataset is provided as a Matlab .mat file in nyu_depth_v2_labeled.mat. This work uses the provided in-painted depth maps and RGB images. In order to prepare depth data offline, depth colorization can be applied as follows:

python utils/depth_colorize.py --dataset "nyuv2" --dataset-path <NYUV2 ROOT PATH>

Fukuoka RGB-D Scene

Fukuoka RGB-D Indoor Scene dataset is used for the first time in the literature for benchmarking in this work. There are 6 categories: corridor, kitchen, lab, office, study room, and toilet (see the download links below). The files should be extracted in a parent folder (e.g. fukuoka). The dataset structure should be something like below:

Fukuoka ROOT PATH
├── fukuoka
│   ├── corridors ...
│   ├── kitchens ...
│   ├── labs ...
│   ├── offices ...
│   ├── studyrooms ...
│   ├── toilets ...

The dataset is organized using the following command, which creates eval-set under the root path:

python utils/organize_fukuoka_scene.py --dataset-path <Fukuoka ROOT PATH> 

Then, depth colorization is applied similar to the other dataset usages.

python utils/depth_colorize.py --dataset "fukuoka" --dataset-path <Fukuoka ROOT PATH>

Evaluation

Trained Models

Trained models that give the results in the paper are provided as follows in a tree hierarchy. Download the models to run the evaluation code. Note that we share the used random weights here. However, it's possible to generate new random weights using the param --reuse-randoms 0 (default 1). The results might change slightly (could be higher or lower). We discuss the effect of randomness in our previous paper here. Note that this (random modeling) should be done during the training process, not only for the evaluation (as the new random set naturally creates a new distribution).

ROOT PATH TO MODELS
├── models
│   ├── resnet101_sun_rgb_best_checkpoint.pth
│   ├── resnet101_sun_depth_best_checkpoint.pth
│   ├── sunrgbd_mms_best_checkpoint.pth
│   ├── nyuv2_mms_best_checkpoint.pth
│   ├── fukuoka_mms_best_checkpoint.pth
├── random_weights
│   ├── resnet101_reduction_random_weights.pkl
│   ├── resnet101_rnn_random_weights.pkl

Evaluation

After data preparation and downloading the models, to evaluate to models on SUN RGB-D, NYUV2, and Fukuoka RGB-D, run the following commands:

python eval_models.py --dataset "sunrgbd" --dataset-path <SUNRGBD ROOT PATH> --models-path <ROOT PATH TO MODELS>
python eval_models.py --dataset "nyuv2" --dataset-path <NYUV2 ROOT PATH> --models-path <ROOT PATH TO MODELS>
python eval_models.py --dataset "fukuoka" --dataset-path <Fukuoka ROOT PATH> --models-path <ROOT PATH TO MODELS>

Results

Multi-modal performance comparison of this work (MMSNet) with the related methods on SUN RGB-D, NYUV2 RGB-D, and Fukuoka RGB-D Scene datasets in terms of accuracy (%). * indicates additional use of large-scale data with multi-task training.

Method Paper SUN RGB-D NYUV2 RGB-D Fukuoka RGB-D
Places CNN-RBF SVM NeurIPS’14 39.0 - -
SS-CNN-R6 ICRA’16 41.3 - -
DMFF CVPR’16 41.5 - -
Places CNN-RCNN CVPR’16 48.1 63.9 -
MSMM IJCAI’17 52.3 66.7 -
RGB-D-CNN AAAI’17 52.4 65.8 -
D-BCNN RAS’17 55.5 64.1 -
MDSI-CNN TPAMI’18 45.2 50.1 -
DF2Net AAAI’18 54.6 65.4 -
HP-CNN-T Auton.’19 42.2 - -
LM-CNN Cogn. Comput.’19 48.7 - -
RGB-D-OB TIP’19 53.8 67.5 -
Cross-Modal Graph AAAI’19 55.1 67.4 -
RAGC ICCVW’19 42.1 - -
MAPNet PR’19 56.2 67.7 -
TRecgNet Aug CVPR’19 56.7 69.2 -
G-L-SOOR TIP’20 55.5 67.4 -
MSN Neurocomp.’20 56.2 68.1 -
CBCL BMVC’20 59.5 70.9 -
ASK TIP’21 57.3 69.3 -
2D-3D FusionNet Inf. Fusion’21 58.6 75.1 -
TRecgNet Aug IJCV’21 59.8 71.8 -
CNN-randRNN CVIU’22 60.7 69.1 78.3
MMSNet This work 62.0 72.2 81.7
Omnivore * CVPR’22 67.1 79.8 -

We also share our LaTeX comparison tables together with the bibtext file for SUN RGB-D and NYUV2 benchmarking (see LaTeX directory). Feel free to use them.

Citation

If you find this work useful in your research, please cite the following papers:

@article{Caglayan2022MMSNet,
    title={MMSNet: Multi-Modal Scene Recognition Using Multi-Scale Encoded Features},
    journal = {Image and Vision Computing},
    volume = {122},
    pages = {104453},
    author={Ali Caglayan and Nevrez Imamoglu and Ryosuke Nakamura},
    doi = {https://doi.org/10.1016/j.imavis.2022.104453},
    year={2022}
}

@article{Caglayan2022CNNrandRNN,
    title={When CNNs meet random RNNs: Towards multi-level analysis for RGB-D object and scene recognition},
    journal = {Computer Vision and Image Understanding},
    author={Ali Caglayan and Nevrez Imamoglu and Ahmet Burak Can and Ryosuke Nakamura},
    volume = {217},
    pages = {103373},
    issn = {1077-3142},
    doi = {https://doi.org/10.1016/j.cviu.2022.103373},
    year={2022}
}

License

This project is released under the MIT License (see the LICENSE file for details).

Acknowledgment

This paper is based on the results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).