Deep Learning and Digital Signal Processing for Environmental Sound Classification

Introduction

Automatic environmental sound classification (ESC) based on ESC-50 dataset (and ESC-10 subset) built by Karol Piczak and described in the following article:

"Karol J. Piczak. 2015. "ESC: Dataset for Environmental Sound Classification." In Proceedings of the 23rd ACM international conference on Multimedia (MM '15). Association for Computing Machinery, New York, NY, USA, 1015–1018. https://doi.org/10.1145/2733373.2806390".

ESC-50 dataset is available from Dr. Piczak's Github: https://github.com/karoldvl/ESC-50/ The following recent article is a descriptive survey for Environmental sound classification (ESC) detailing datasets, preprocessing techniques, features and classifiers. And their accuracy.

Anam Bansal, Naresh Kumar Garg, "Environmental Sound Classification: A descriptive review of the literature, Intelligent Systems with Applications, Volume 16, 2022, 200115, ISSN 2667-3053, https://doi.org/10.1016/j.iswa.2022.200115.

Dr. Piczak maintains a Table with best results in his Github, with authors, publication, method used. We reproduce the top of the Table here, for supervised classification.

_Title	_Notes	_Accuracy	_Paper	_Code
_{BEATs: Audio Pre-Training with Acoustic Tokenizers}	_{Transformer model pretrained with acoustic tokenizers}	_98.10%	_chen2022	📜
_{HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection}	_{Transformer model with hierarchical structure and token-semantic modules}	_97.00%	_chen2022	📜
_{CLAP: Learning Audio Concepts From Natural Language Supervision}	_{CNN model pretrained by natural language supervision}	_96.70%	_elizalde2022	📜
_{AST: Audio Spectrogram Transformer}	_{Pure Attention Model Pretrained on AudioSet}	_95.70%	_gong2021	📜
_{Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer}	_{A Transformer model pretrained w/ visual image supervision}	_95.70%	_zhao2022	📜

We develop our own pre-processing techniques for achieving best accuracy results based on Dr. Piczak Table and Bansal et al.
At that point, and before we start working on more advanced techniques:

we work with the ESC-10 sub-dataset.

we test mel-spectrograms and wavelet transforms.

We will train a Convolution Neural Network with grayscale spectrograms and scalograms. We target an accuracy >>90 %.
When tests with the most effective CNN algorithm implementation are completed, we will run predictions with various audio clips downloaded from Youtube. And eventually update CNN hyperparameters.

ESC-10 Type of sounds/noises

The ESC-10 dataset contains 5 seconds long 400 Ogg Vorbis audio clips: sampling frequency: 44.1 kHz, 32- bits float, and 10 classes.
40 audio clips per class.
The 10 Sound/Noise classes are:

Class = 01-Dogbark, Label = 0
Class = 02-Rain, Label = 1
Class = 03-Seawaves, Label = 2
Class = 04-Babycry, Label = 3
Class = 05-Clocktick, Label = 4
Class = 06-Personsneeze, Label = 5
Class = 07-Helicopter, Label = 6
Class = 08-Chainsaw, Label = 7
Class = 09-Rooster, Label = 8
Class = 10-Firecrackling, Label = 9

Quick analysis of the type of sound/noise:

dogbarking, babycry, person sneeze, rooster, involve non-linear vibration and resonance of vocal (or nasal) tract and cords, a bit like speech, and is considered non-stationnary.
Rain, sea waves are somewhat stationary, rain sounds a bit like white noise. Pseudo-stationnary because in various audio clips other noises are involved at times.
Helicopter, chainsaw: pseudo-stationary. If the engine r.p.m does not change in a timeframe, the process is stationary. With harmonics linked to the engine rpm, number of cylinders, and the number of rotor blades (helicopter).
Fire crakling: impulsive noise. But with pseudo-stationary background noise.
Clock tick: It depends. Impulsive every second (frequency= 1 Hz). But in some audio clips, there are several "pulsations" in a 1 second time frame. And the ticks have the signature of a non-linear mechanical vibration that radiates sound, with harmonics.

Quick Literature review

Methodology

In an effort to reduce the size of the problem and computation time, while retaining relevant information, we:
- reduce audio sampling frequency from 44.1 kHz to 22.05 kHz.
- reduce the size of audio clips, to 1.25s, based on signal power considerations. Too many audio clips have occurences of the same sound phenomenon: dog barking, baby crying for example and most of the signal is "silence".
Normalize audio signal amplitude to 1. (0 dBFS).
Compute mel-spectrograms or Wavelet transforms in the 10 classes. We empirically optimized wavelet selection. And wavelet transform parameters.
Reduce the size of scalograms in the time domain (some details are lost).
Train a CNN on 256x256 grayscale mel-spectrograms or 2 series of 128x128 grayscale scalograms: magnitude and phase. Train/Test split: 80/20 %

We tested three methods:

Mel-spectrograms.
Complex Continuous Wavelet Transforms (complex CWT).
Fusion mel-spectrograms + complex CWT.

After a 80%/20% train/test sets split, we train a Convolutional Neural Networks with 32-64-128-256 neurons hidden layers. Parameters are detailed in the notebooks CNN section.
Note: Although mel-spectrograms and wavelet transforms are shown in color, the CNN is trained with grayscale images.

ESC-10 Results Synthesis

Best accuracy with 3 different methods are synthesized in the the Table below.

_Method	_Accuracy
_{256x256 Mel-spectrograms}	_{92.5 %}
_{128x128 Complex CWT Scalograms Magnitude + Phase}	_{94 %}
_{128x128 Fusion Complex CWT + Mel-Spectrograms}	_{99 %}

Details of the best result with the "Fusion" method:


_{Classification report}	_{Confusion matrix}

Jupyter Notebooks

All Jupyter Notebooks share the same structure. They are identical except when we implement wavelet transforms or mel-spectrogram transforms.

Part I: Mel-Spectrograms and Convolutional Neural Networks (CNN)

Reduction of audio clips length and optimization of mel-spectrogram parameters for best discrimination of sound categories. We train the CNN with 256x256 grayscale images. Accuracy: ~92.5%


_{Mel-spectrograms (dB)}

Part II: Complex Wavelet Transform and Convolutional Neural Networks (CNN)

Optimization of wavelet selection and parameters for best discrimination of sound classes.
Wavelet selection: the difficulty here is the selection of the right wavelet suited to the full range of noise types: pseudo-stationary, non-stationary, transient/impulsive.
Applying different wavelets to each type of sound significantly improves classification accuracy. We train the CNN with 2 128x128 grayscale images per audio clip: scalogram magnitude and phase. Accuracy ~ 94%.


_{Scalograms magnitude (dB)}


_{Scalograms phase (rad)}

Part III: Fusion: Complex Wavelet Transforms + Mel-Spectrograms and CNN

Combining Mel-Spectrograms (Part I) with Complex Wavelets Transforms (Part II) enhances accuracy with features that are difficult to discriminate. We train the CNN with 3 128x128 grayscale images per audio clip. Accuracy. ~ 99%.


_{Rooster: Scalogram Magnitude (dB), Phase (rad) + Mel-spectrogram (dB)}

License

ESC-50: Dataset for Environmental Sound Classification
https://github.com/karoldvl/ESC-50/
https://dx.doi.org/10.7910/DVN/YDEPUT

Dataset license

The dataset as a whole is available under the terms of the Creative Commons Attribution-NonCommercial license (http://creativecommons.org/licenses/by-nc/3.0/).

The ESC-10 subset is licensed as a Creative Commons Attribution 3.0 Unported
(https://creativecommons.org/licenses/by/3.0/) dataset.

Licensing/attribution details for individual audio clips are available in file:

License

Name		Name	Last commit message	Last commit date
Latest commit History 93 Commits
CNN_ConfusionMatrix_figure2.png		CNN_ConfusionMatrix_figure2.png
CNN_ConfusionReport2.png		CNN_ConfusionReport2.png
CNN_Fusion_ClassificationReport_98pc.png		CNN_Fusion_ClassificationReport_98pc.png
CNN_Fusion_ConfusionMatrix_98pc.png		CNN_Fusion_ConfusionMatrix_98pc.png
CNN_Fusion_Epochs_98pc.png		CNN_Fusion_Epochs_98pc.png
CWT_Magnitude_3pics.png		CWT_Magnitude_3pics.png
CWT_Phase_3pics.png		CWT_Phase_3pics.png
CWT_mel_rooster_grayscale.png		CWT_mel_rooster_grayscale.png
ESC10-Sound-Classification-Mel-Spectrograms_v04.ipynb		ESC10-Sound-Classification-Mel-Spectrograms_v04.ipynb
ESC10-Sound-Classification-WaveletMelSpecTransforms_Fusion_v03.ipynb		ESC10-Sound-Classification-WaveletMelSpecTransforms_Fusion_v03.ipynb
ESC10-Sound-Classification-WaveletTransforms_phase_v06.ipynb		ESC10-Sound-Classification-WaveletTransforms_phase_v06.ipynb
ESC10_Wavelet_Transform_Mag.png		ESC10_Wavelet_Transform_Mag.png
ESC10_Wavelet_Transform_Phase.png		ESC10_Wavelet_Transform_Phase.png
Fusion_Wavelet_Phase_Classification_99pc.png		Fusion_Wavelet_Phase_Classification_99pc.png
Fusion_Wavelet_Phase_ConfusionMatrix_99pc.png		Fusion_Wavelet_Phase_ConfusionMatrix_99pc.png
Fusion_images_3pics.png		Fusion_images_3pics.png
LICENSE		LICENSE
Mel-Spectrogram001.png		Mel-Spectrogram001.png
Mel-Spectrogram3_002.png		Mel-Spectrogram3_002.png
MelSpectrogram.png		MelSpectrogram.png
MelSpectrogram_91pc.png		MelSpectrogram_91pc.png
MelSpectrogram_91pc_classification.png		MelSpectrogram_91pc_classification.png
MelSpectrogram_91pc_spectros.png		MelSpectrogram_91pc_spectros.png
MelSpectrogram_classification_93pc.png		MelSpectrogram_classification_93pc.png
MelSpectrogram_confusion_93pc.png		MelSpectrogram_confusion_93pc.png
Melspectrogram_91pcA.png		Melspectrogram_91pcA.png
README.md		README.md
Wavelet_Phase_94pc.png		Wavelet_Phase_94pc.png
Wavelet_Phase_ConfusionMatrix_94pc.png		Wavelet_Phase_ConfusionMatrix_94pc.png
Wavelets_transform3_002.png		Wavelets_transform3_002.png

License

DrStef/Deep-Learning-and-Digital-Signal-Processing-for-Environmental-Sound-Classification

Folders and files

Latest commit

History

Repository files navigation

Deep Learning and Digital Signal Processing for Environmental Sound Classification

Introduction

ESC-10 Type of sounds/noises

Quick Literature review

Methodology

ESC-10 Results Synthesis

Jupyter Notebooks

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages