Skip to content

SwamiKannan/Scuba-diving-gesture-recognition-using-Mediapipe

Repository files navigation

Scuba-diving-gesture-recognition

Scuba diving gestures recognition using Mediapipe, cv2 and PyTorch

Inspiration

I was always been a huge fan of Minority Report and awaited the day when we could use gestures for our day to day use. Then came Pranav Mistry with his Sixth Sense technology which blew my mind. However it was too hardware focussed.

Google came out with Mediapipe in 2019. I had just completed my Scuba certifications in Open and Advanced Open Water scuba diving when I came across some cool animations on Facebook using Mediapipe. A quick search led me to Nicholas Renotte's famous Sign Language video. I was super impressed by the process of webcam images using cv2 and implemented the approach for scuba diving signals (I had just completed by Open Water and Advanced Open water courses then) using Pytorch.

Mediapipe:


Objective:

To train a model to capture simple video feed from the webcam and categorize the gestures shown by the user into one of five actions:

  1. Ok
  2. Stop
  3. Descend
  4. Not Ok
  5. Ascend
    1. Contents:

      I. Data capture

      • Testing the camera, mediapipe library (To ensure adequacy of lighting / setup, get fps of the camera to calculate sequence length
      • Capture data from the webcam for various actions i.e. for 5 actions, gather 20 samples, each of which is a 1 second video (30 frames)

      II. Data Processing

      • Convert 1500 numpy files of gestures,each of 63 points, to a 150X30X63 tensor matrix
      • One-hot code labels and save both files

      III. Model training

      • Import model architecture and train the model on a single batch of 142 samples (after train and test split)
      • Train on test clips

      IV. Testing on live data

      • Testing on live data feed
      • Processing and storing the renders

      Key learnings:

      • Mediapipe landmarks can get significantly impacted by the lighting - both at the time of data collection and at the time of inference.
      • If you're not careful while consolidating the various frames for your input dataset, the order of labels can get scattered. After completing your one-hot encoding, run a sample check across your classes to determine its index in the encoding
      • More samples ! I could record only 150 samples across 5 different action classes
      • Stability over precision. Video processing has this annoying property of rapidly changing classes as frames change. So to avoid this, it takes this model about 3-4 frames before it stabilizes its prediction class. Hence, you may see a bit of jitter in the displayed result but it will immediately stabilize. I had this issue in my [object classification project](https://github.com/SwamiKannan/Formula1-car-detection)

      Sample video


      Gestures will take a second to align to the correct label

      Image credit for cover image: Rooster Teeth

Releases

No releases published

Packages

No packages published