Skip to content

FrederikSchorr/sign-language

Repository files navigation

Sign Language Recognition

  • This prototype "understands" sign language for deaf people
  • Includes all code to prepare data (eg from ChaLearn dataset), extract features, train neural network, and predict signs during live demo
  • Based on deep learning techniques, in particular convolutional neural networks (including state-of-the-art 3D model) and recurrent neural networks (LSTM)
  • Built with Python, Keras+Tensorflow and OpenCV (for video capturing and manipulation)

For 10-slide presentation + 1-min demo video see here.

Requirements

This code requires at least

  • python 3.6.5
  • tensorflow 1.8.0
  • keras 2.2.0
  • opencv-python 3.4.1.15

For the training of the neural networks a GPU is necessary (eg aws p2.xlarge). The live demo works on an ordinary laptop (without GPU), eg MacBook Pro, i5, 8GB.

Get the video data

See here for overview of suitable data-sets for sign-language for deaf people: https://docs.google.com/presentation/d/1KSgJM4jUusDoBsyTuJzTsLIoxWyv6fbBzojI38xYXsc/edit#slide=id.g3d447e7409_0_0

Download the ChaLearn Isolated Gesture Recognition dataset here: http://chalearnlap.cvc.uab.es/dataset/21/description/ (you need to register first)

The ChaLearn video descriptions and labels (for train, validation and test data) can be found here: data_set/chalearn

prepare_chalearn.py is used to unzip the videos and sort them by labels (using Keras best-practise 1 folder = 1 label): folderstructure

Prepare the video data

Extract image frames from videos

frame.py extracts image frames from each video (using OpenCV) and stores them on disc.

See pipeline_i3d.py for the parameters used for the ChaLearn dataset:

  • 40 frames per training/test videos (on average 5 seconds duration = approx 8 frames per second)
  • Frames are resized/cropped to 240x320 pixels

Calculate optical flow

opticalflow.py calculates optical flow from the image frames of a video (and stores them on disc). See pipeline_i3d.py for usage.

Optical flow is very effective for this type of video classification, but also very calculation intensive, see here.

Train the neural network

train_i3d.py trains the neural network. First only the (randomized) top layers are trained, then the entire (pre-trained) network is fine-tuned.

A pre-trained 3D convolutional neural network, I3D, developed in 2017 by Deepmind is used, see here and model_i3d.py.

Training requires a GPU and is performed through a generator which is provided in datagenerator.py.

Note: the code files containing "_mobile_lstm" are used for an alternative NN architecture, see here.

Predict during live demo

livedemo.py launches the webcam,

  • waits for the start signal from user,
  • captures 5 seconds of video (using videocapture.py),
  • extracts frames from the video
  • calculates and displays the optical flow,
  • and uses the neural network to predict the sign language gesture.

The neural network model is not included in this GitHub repo (too large) but can be downloaded here (150 MB).

License

This project is licensed under the MIT License - see the LICENSE file for details

Acknowledgments