Skip to content

Eric-Canas/Drums-app

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Play Drums in your Browser.

Drums-app allows you to simulate in your browser any percussion instrument, by using only your Webcam. All machine learning models run locally, so no user information is sent to the server.

Check the demo at drums-app.com

Quick Start

Simply run the src/index.html in server mode, or enter at drums-app.com.

Select Set Template for building your own drums template by uploading some images and attaching your sounds to them.

Set Template

Turn on your webcam and enjoy it!

Play!

*No cats were harmed during this recording

Implementation Details

This web application is built with MediaPipe and TensorFlow.js.
The pipeline uses two Machine Learning models.

  • Hands Model: A Computer Vision model offered by MediaPipe for detecting 21 landmarks for each hand (x, y, z).
  • HitNet: An LSTM model that has been developed in Keras for this application and then converted to TensorFlow.js. It takes the last N positions of a hand and predicts the probability of this sequence to correspond with a Hit.

HitNet Details

Building the Dataset

The dataset used for training has been built in the following way:

  1. A representative landmark (Index Finger Dip [Y]) of each detected hand is plotted in an interactive chart, using Chart.js.
  2. Any time that a key is pressed, a grey mark is plotted on the same chart.
  3. I start to play drums with one hand while pressing a key on the keyboard (with the other hand) every time that I beat an imaginary drum. [Gif Left]
  4. I use the mouse for selecting in the chart those points that should be considered as a hit. [Gif Right]
  5. When click the "Save Dataset" button, all hand positions together with their correspondent tags (1 if the frame was considered a hit or 0 otherwise) are downloaded as a JSON file .

DatasetGeneration

DataTag

Defining the Architecture

HitNet has been built in Python, using Keras, and then exported to TensorFlow.js. In order to not produce any dissonance between the hit on the drum and the produced sound HitNet must run as fast as possible, for this reason it implements an extremely simple architecture.

HitNet Architecture

It takes as input the 4 last detections of a hand [Flatten version of its 21 landmarks (x,y,z)] and outputs the probability of this sequence to correspond with a hit. It is only composed by an LSTM layer followed by a ReLU activation (using dropout with p = 0.25) and a Dense output layer with only 1 unit, followed by a sigmoid activation.

Training the model

HitNet has been trained in Keras, using the following parameterization:

  • Epochs: 3000.
  • Optimizer: Adam.
  • Loss: Weighted Binary Cross Entropy*.
  • Training/Val Split: 0.85-0.15.
  • Data Augmentation:
    • Mirroring: X axis.
    • Shift: Shift applied in block for the whole sequence.
      • X Shift: ±0.3.
      • Y Shift: ±0.3.
      • Z Shift: ±0.5.
    • Interframe Noise: Small shift applied independently to each frame of the sequence.
      • Interframe Noise X: ±0.01.
      • Interframe Noise Y: ±0.01.
      • Interframe Noise Z: ±0.0025.
    • Intraframe Noise: Extremely small shift applied independently to each single part of a hand.
      • Intraframe Noise X: ±0.0025.
      • Intraframe Noise Y: ±0.0025.
      • Intraframe Noise Z: ±0.0001.

The weights exported to TensorFlow.js are not the ones of the last epoch, but the ones that maximized the Validation Loss at any intermediate epoch.

*Loss is weighted since the positive class is extremely underrepresented in the training set.

Analyzing Results

Confusion matrices show that results are pretty high for both classes putting the confidence threshold at 0.5.

Train Confusion Matrix

Validation Confusion Matrix

Despite these False Positives and False Negatives could worsen the user experience in a network that is executed several times each second, it does not really affect the playtime in a real situation. It is due to three factors:

  1. Most False Positives come from the frames anterior or posterior to the hit. In practice, it is solved by emptying the sequence buffers every time that a hit is detected.
  2. The small amount of False Negatives detected in the train set comes from Data Augmentation or because it is detected on the previous or the following frame. In real cases, these displacements does not affect to the experience.
  3. The rest of False Positives does not use to appear in real cases since, during playtime, only the sequences including detections entering in the predefined drums are analyzed. In practice it works as double check for the positive cases.

Evolution of the Train/Validation Loss during training confirms that there has been no overfitting.

Loss