This repository contains an implementation of our proposed algorithm for grasp detection in dense clutter. The algorithm consists of three steps: instance segmentation, view-based experience transfer and optimal grasp determination.
- Instance Segmentation - Mask R-CNN is adopted to segment easy-to-grasp objects from a clutter scene.
- View-based Experience Transfer - Denoise Autoencoder is used to estimate corresponding view of each segmented object. Then, grasp experiences can be transfered onto a clutter scene.
The system consisting of a six-axis robot arm with two-jaw parallel gripper and Kinect V2 RGB-D camera is used to evaluate the success rate for grasping in dense clutter. The grasping results on cluttered metal parts show that the success rate is about 94%.
Demonstration of the hand-eye system and the algorithm
Demonstration of two types of grasping methods
For more information about our approach, please check out our summary video and our paper:
Jen-Wei Wang and Jyh-Jone Lee
If you have any questions, please mail to Jen-Wei Wang
To run this code, please navigate to algorithm.
cd algorithm
This code was developed with Python 3.5 on Ubuntu 16.04 and NVIDIA 1080ti. Python requirements can be installed by:
pip install -r requirements.txt
There are two pre-trained models:
- Mask R-CNN can be downloaded at here
- Denoise Autoencoder is included in three files named as chkpt-80000.
Testing images are provided at test_images. Run our code on testing images:
python detection_algorithm.py --rgb=./test_images/rgb.png --depth=./test_images/depth.png
Testing results will be saved at test_images.
Clutter Scene | Segmentation | Collision-Free Grasps | Optimal Grasps |
---|---|---|---|
config.yaml contains some parameters than can be adjusted.
The code for Mask R-CNN is based on repository implemented by matterport.
The annotated dataset is provided at here. We improve the data-collecting and labeling process. The details are shown in following figure and our paper.
To prevent over-fitting, the revisions are:
- Use pre-trained ResNet-50 as backbone
- Fine-tune parameters not in the backbone
- Reduce types of anchors from 5 to 3
The results of mAP for RGB and RGB-D image as two types of inputs are respectively, 0.901 and 0.924.
The code for Denoise Autoencoder is based on repository implemented by DLR-RM. To estimate views more correctly, we redefine the loss function as L2 loss plus perceptual loss. The details are shown in following figure and our paper.
The object views and their corresponding grasp experiences are provided at here
- Download pre-trained vgg-16 model at here and put the model to the directory.
- Put our provided files in denoise_ae folder to the same directory.
- Start training with the same process explained in the repository.
The recall of pose estimation on T-LESS dataset is about 50.31.