hand-gesture-recognition

This model is utilized in the hand gesture recognition system within the MeCO robot at the Interactive Robotics and Vision Laboratory, University of Minnesota.

Introduction

The hand gesture recognition system consists of two main parts: hand detection and gesture classification. Initially, the YOLOv7-tiny algorithm detects the region where hands are present in the given image. After detection, these regions are cropped from the images and then classified using a multitasking network. The structure of this multitasking network is shown above.

Our multitasking network leverages enriched features to attain high-performance classification with fewer model parameters. We achieved this by training the network jointly on the classification and pose estimation tasks, with the pose estimation task serving as an auxiliary task that is not used in the application. The network starts by generating dense features from input images using a CNN backbone. These features are then combined with a learnable class embedding and fed into a ViT encoder. Afterward, the class embedding is separated from the features and classified by a linear layer. To further enhance the transformer's learned features, the remaining features are decoded by a simple decoder to generate hand poses.

Dataset

We trained our model on the large-scale hand gesture dataset: HaGRID - HAnd Gesture Recognition Image Dataset. We include the no_gesture class, resulting in 19 classes in total.

Results

This result is tested on the HaGRID test data.

Usage

Environment

Build Docker environment

docker build -t hand-gesture:latest docker/

Run container

docker run -it --rm --gpus all --ipc=host --ulimit memlock=-1 --network="host" \
    -e DISPLAY=$DISPLAY \
    -v /tmp/.X11-unix:/tmp/.X11-unix \
    -v $HOME/.Xauthority:/root/.Xauthority \
    -v <PATH_OF_THE_REPOSITORY>:/workspace \
    hand-gesture:latest

Dataset
Download images and annotations from HaGRID - HAnd Gesture Recognition Image Dataset.

We utilized MediaPipe to generate hand pose annotations as ground truth for supervision. In addition, to reduce training time and save CPU resources, we crop the hand region from the original data. Use the following commands to extract hand images and annotations for training:
```
python extract_data.py --root_dir <DOWNLOADED_HAGRID_DATA>
```
(Optional) Run display_data.py to check if data are loaded correctly.
```
python display_data.py
```

Train

python train.py \
    --data_config configs/hagrid.yaml \
    --suffix best \
    --batch_size 32 \
    --num_workers 8 \
    --epochs 40 \
    --lr 0.0001 \
    --lr_step 30 \
    --image_size 192 192 \

Export
Export PyTorch model to ONNX

python export.py \
    --data_config configs/hagrid.yaml \
    --image_size 192 192
    --weight_path <YOUR_MODEL_WEIGHT_PATH>

You can also download the model we trained for demo.

Inference
For inference, we use YOLOv7-tiny to detect the hand region from the whole image. The detector is trained on data collected by the Interactive Robotics and Vision Laboratory at the University of Minnesota. After extracting hand regions from images, it will be classified by the multi-tasking model. The path of the inference data should be either a video file or a folder containing image files.
```
python detect.py \
    --data_config configs/hagrid.yaml \
    --cls_weight gesture-classifier.onnx \
    --det_weight yolov7-tiny-diver.onnx \
    --data_path <YOUR_TEST_DATA>
```

References

Y. Xu, J. Zhang, Q. Zhang, and D. Tao, Vitpose: Simple vision transformer baselines for human pose estimation, arXiv preprint arXiv:2204.12484, 2022.
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, An image is worth 16x16 words: Transformers for image recognition at scale, ICLR, 2021.
A. Kapitanov, K. Kvanchiani, A. Nagaev, R. Kraynov, and A. Makhliarchuk, HaGRID - HAnd Gesture Recognition Image Dataset, in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 4572–4581, January 2024.
B. Xiao, H. Wu, and Y. Wei, Simple Baselines for Human Pose Estimation and Tracking, in European Conference on Computer Vision (ECCV), 2018.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

configs

configs

docker

docker

images

images

libs

libs

model

model

.gitignore

.gitignore

README.md

README.md

detect.py

detect.py

display_data.py

display_data.py

export.py

export.py

extract_data.py

extract_data.py

train.py

train.py

Repository files navigation

hand-gesture-recognition

Introduction

Dataset

Results

Usage

References

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
configs		configs
docker		docker
images		images
libs		libs
model		model
.gitignore		.gitignore
README.md		README.md
detect.py		detect.py
display_data.py		display_data.py
export.py		export.py
extract_data.py		extract_data.py
train.py		train.py

yingkunwu/hand-gesture-recognition

Folders and files

Latest commit

History

Repository files navigation

hand-gesture-recognition

Introduction

Dataset

Results

Usage

References

About

Topics

Resources

Stars

Watchers

Forks

Languages