Skip to content

πŸ† πŸ† Top-1 Submission to CORSMAL Challenge 2020 (at ICPR). The winning solution for the CORSMAL Challenge (on Intelligent Sensing Summer School 2020)

v-iashin/CORSMAL

Repository files navigation

The CORSMAL Challenge 2020 Submission

We also composed a Tech Report πŸ“–. Feel free to check it out to learn more about the implementation details.

Also here is the video presentation from the conference on YouTube πŸŽ₯.

Overview πŸ”Ž

The design of an individual model for each sub-task: container filling level (left), filling type (middle) classification, and container capacity estimation (right).

CORSMAL Challenge 2020 Submission (Because It`s Tactile)

For the filling level estimation we rely on both modalities: RGB (R(2+1)d features) and audio ("classical" and VGGish features). In particular, we encode streams from each camera placed at different view points with the individual GRU models. Next, we sum up the logits and apply softmax at the output. Next, we process VGGish featured extracted from the audio stream and process it with another GRU model which outputs the class probabilities. Then, the "classical" features are classified with the random forest algorithm. Finally, the probabilities from all three models are averaged to produce the final prediction for the filling level.

The procedure for filling type classification resembles the one for the filling level except for the absence of RGB stream.

The capacity estimation pipeline starts with the extraction of two frames from two camera views (left and right): the 1st and the 20th frames to the end. Both frames are passed through the Mask R-CNN to detect a container. The detected bounding boxes are used to crop both frames. The crops are sent to the LoDE model which outputs the estimation of the container capacity. If an object was not detected on either of the frames, we use the training prior.

Feel hungry for more details, check out our Tech Report πŸ“–.

Team πŸ‘‹

Task βš”οΈ

The CORSMAL challenge focuses on the estimation of the weight of containers which depends on the presence of a filling and its amount and type, in addition to the container capacity. Participants should determine the physical properties of a container while it is manipulated by a human, when both containers and fillings are not known a priori.

Technically, the main task requires to estimate the overall filling mass estimation. This quantity can be estimated by solving three sub-tasks:

  • Container capacity estimation (any positive number)
  • Filling type classification (boxes: pasta, rice; glasses/cups: water, pasta, rice; or nothing (empty))
  • Filling level classification (0, 50, 90%)

Dataset πŸ₯€πŸ“˜πŸ₯›

  • Download the CORSMAL dataset here
  • The dataset consists of 15 containers: 5 drinking cups, glasses, and food boxes. These containers are made of different materials, such as plastic, glass, and paper.
  • A container can be filled with water (only glasses and cups), rice or pasta at 3 different levels of 0, 50, and 90% with respect to the capacity of the container.
  • All different combinations of containers are executed by a different subject (12) for each background (2) and illumination condition (2). The total number of configurations is 1140.
  • Each event in the dataset is acquired with several sensors, making the CORSMAL dataset to be multi-modal
    • 4 cameras (1280x720@30Hz):
      • RGB,
      • narrow-baseline stereo infrared,
      • depth images,
      • inertial measurements (IMU)
      • calibration
    • 8-channel 44.1 kHz audio;

Code organization πŸ“

Each folder in this repo corresponds to a dedicated task. Filling level and types have two or three different approaches and, hence, each approach has an individual folder.

  • /capacity
  • /filling_level
    • /CORSMAL-pyAudioAnalysis
    • /r21d_rgb
    • /vggish
    • README.md
  • /filling_type
    • /CORSMAL-pyAudioAnalysis
    • /r21d_rgb
    • /vggish
    • README.md

How to run evaluation ❔

We recommend to use our docker image to run the evaluation script. (To experement with different approaches, inspect individual README files inside of the task folders.) Another way to run it without docker is to install the environments using the commands from Dockerfile (ignore the first 11 lines) and execute run_eval.sh script.

Running the script will take <4 hours on 10-core i9-7900X X-series, 4x1080Ti (one gpu is enough), RAM 64Gb (or at least 20Gb but the more GPU you will use the more RAM it will allocate), 40Gb of extra disk space (besides the dataset).

Install docker (19.03.13) and run our script

# pull our image from the docker hub
docker pull iashin/corsmal:latest

# source: the path to dir with corsmal on the host; destination: path where the sorce folder will be mounted
# if you would like to attach shell (and debug) just append `/bin/bash` to the command above
# because, by default, it will run the evaluation script
# Also, the script will work even with just one GPU but if you would like to speed up the 1st part try more
docker run \
    --mount type=bind,source=/path/to/corsmal/,destination=/home/ubuntu/CORSMAL/dataset/ \
    -it --gpus '"device=0,1,2,3"' \
    iashin/corsmal:latest

# copy submission files from the container once it finishes running the script
docker cp container_id:/home/ubuntu/CORSMAL/submission_public_test.csv .
docker cp container_id:/home/ubuntu/CORSMAL/submission_private_test.csv .

The expected structure of the DATA_ROOT folder (it came to our attention that the folder structure of the dataset has changed in the following editions of the challenge):

DATA_ROOT
β”œβ”€β”€ [1-9]
β”‚   β”œβ”€β”€ audios
β”‚   β”‚   └── sS_fiI_fuU_bB_lL_audio.wav
β”‚   β”œβ”€β”€ calib
β”‚   β”‚   └── sS_fiI_fuU_bB_lL_cC_calib.pickle
β”‚   β”œβ”€β”€ depth
β”‚   β”‚   └── sS_fiI_fuU_bB_lL
β”‚   β”œβ”€β”€ ir
β”‚   β”‚   └── sS_fiI_fuU_bB_lL_cC_irR.mp4
β”‚   └── rgb
β”‚       └── sS_fiI_fuU_bB_lL_cC.mp4
└── [10-15]
    β”œβ”€β”€ audio
    β”‚   └── XXXX
    β”œβ”€β”€ calib
    β”‚   └── XXXX_cC_calib.pickle
    β”œβ”€β”€ depth
    β”‚   └── XXXX
    β”œβ”€β”€ ir
    β”‚   └── XXXX_cC_irR.mp4
    └── rgb
        └── XXXX_cC.mp4

Please note, we undertook extreme care to make sure our results are reproducible by fixing the seeds, sharing the pre-trained models, and package versions. However, the training on another hardware might give you slightly different results. We observed that the change is 🀏 <0.01.

How to train a new model? πŸš€

Please follow the instructions provided in filling_level and filling_type. Note, we do not use any training for capacity sub-task but you still can adapt the code to apply it on your dataset.

License πŸ‘©β€βš–οΈ

We distribute our code under MIT licence. Yet, our code relies on libraries which have different licence.

  • LoDE (container capacity): Creative Commons Attribution-NonCommercial 4.0
  • pyAudioAnalysis (filling level and type): Apache-2.0 License
  • video_features (filling level and type): check here:

BibTeX πŸ™‚

@misc{CORSMAL_Iashin_2020,
      title={Top-1 CORSMAL Challenge 2020 Submission: Filling Mass Estimation Using Multi-modal Observations of Human-robot Handovers},
      author={Vladimir Iashin and Francesca Palermo and G\"okhan Solak and Claudio Coppola},
      year={2020},
      eprint={2012.01311},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

About

πŸ† πŸ† Top-1 Submission to CORSMAL Challenge 2020 (at ICPR). The winning solution for the CORSMAL Challenge (on Intelligent Sensing Summer School 2020)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published