This was an experiment to see how well the Mel Frequency Cepstral Coefficients (MFCC’s) and Chroma analysis are doing in extracting features from audio signals. To do this, I wanted to detect whether some song is by Chet Baker or Beyonce - clearly two very different genres. This turned out horribly difficult to accomplish, so I moved onto simpler data and used raw audio books to detect if some voice is of a women or man. Next, I transformed the preprocessed audio snippets, and fed them all into a neural network to classify different pitches.
- see the soundTransformation directory for my implementation of the Mel Frequency Cepstral Coefficients (MFCC’s). For most of the transformation I used the Python Librosa library
- see all preprocessing in the music directory
- run
chromogram.py
to get the cleaned up sound input from sound_input.wav -> this contains a data array of 1 second CQT transformed clips. - run network by
test.py
- cleaning up and processing of two audiobooks with male & female voices (concatenated two files, trimmed to equal lengths, removed silence below 20 decibels)
- implementation of the sound transformation both with MFCCs, and CQT
- placed 1 second clips of the audio data into a numpy array
- network all set up -> 99% accuracy when testing on people from the training sample
- 84% accuracy when testing on people different from the training sample
For robustness: train on a new data array with a bigger variety of females/male voices -> problematic as there is no data In terms of trying different models, for more complex tasks, a recurrcent neural network would be more appropriate, so an idea could be to try that with on a larger dataset -> again, tough to get a larger sample.