Skip to content

🎸 Audio Instrument Classification – 2nd place solution – Lumen Data Science 2023

Notifications You must be signed in to change notification settings

ciglenecki/lumen-audio

Repository files navigation

🎸 Lumen Data Science 2023 – Audio Classification (2nd place)




Presentation Technical Documentation Project Documentation Experiments



πŸ† Fast and Fourier team


Vinko DraguΕ‘ica

Filip Mirković


Ivan Rep


Matej Ciglenečki

Setup

Python Virtual Environment

Create and populate the virtual environment. Simply put, the virtual environment allows you to install Python packages for this project only (which you can easily delete later). This way, we won't clutter your global Python packages.

Step 1: Execute the following command:

python3 -m venv venv
source venv/bin/activate
sleep 1
pip install -r requirements.txt
pip install -r requirements-dev.txt

Step 2: Install current directory as a editable Python module:

pip install -e .

(optional) Step 3: Activate pre-commit hook

pre-commit install

Pre-commit, defined in .pre-commit-config.yaml will fix your imports will make sure the code follows Python standards

To remove pre-commit run: rm -rf .git/hooks

πŸ“ Directory structure

Directory Description
data datasets
docs documentation
figures figures
models model checkpoints, model metadata, training reports
references research papers and competition guidelines
src python source code

Tasks

  • create eval script which will caculate ALL metrics for the whole dataset
  • y_true, y_pred
    • confusion matrix
    • distribution of prediction metrics (hammings score, f1, acc)
    • plot per instrument for each metric
    • roc curve
  • multiple dataset distribution plotting
  • instrument count histogram plot
  • create backend API/inference
    • load model in inference, caculate metrics the whole test irmas dataset (analitics)
      • should reuse the train.py script, just use different modes?
    • http server with some loaded model which returns responses
  • technical documentation
  • visualize embedded features: for each model with tensorboard embedder https://projector.tensorflow.org/
  • add a feature that uses different features per channel - convolutional models expect a 3-channel tensor, so lets make full use of those 3 channels
  • add fluffy support for all models
  • try out focal loss and label smoothing: https://pytorch.org/vision/main/_modules/torchvision/ops/focal_loss.html
  • convert all augmentations so they happen on the GPU
    • remove audio transform from the dataset class
    • implement on_after_batch_transfer
    • both whole audio transform (along with augmentations) in the Model itself
    • model then calls on_after_batch_trasnfer automatically and does the augmenations
    • run experiments in both cases
  • make sure augmetantions happen in batch
  • Add ArcFace module in codebase
  • Rep vs IRMAS: perform validation on Rep's corected dataset to check how many labels are correctly marked in the original dataset
    • check if all instruments are correct
    • check if at least one instrument is correct
    • hamming distance between Rep's and original
    • how dirty is the training set in terms of including non-predominant instruments
  • train with relabeled data (cleanlab): (@matej has to provide csv) Include train override csv. No augmentations. Compare both models metrics.
  • Inference analysis: run inference on single audio with multiple different durations (run on 10, 20, ..., 590, 600 seconds)
  • Train Wav2Vec2 CNN: IRMAS only no aug
  • Fluffy: Directly compare Fluffy Deep head CNN to standard Deep head CNN
  • Add Focal Loss, InstrumentFamilyLoss to src/model/loss_functions.py and add SupportedLosses
  • check whatsup with pretrained weights (crop and resize) -> everything is fine
    • turns out that the models use average pooling over the height and width which means that the final representation only has dimension (B, C)
    • the model silently fails instead of breaking, so keep an eye out in case something doesn't work
  • train with relabeled data (rep): Include Ivan's relabeled data and retrained some model to check performance boost (make sure to pick a model which already works)
  • Train efficenet irmas only no aug with small batch size=4
  • train ResNeXt 50_32x4d on MelSpectrogram
    • Compare how augmentations affect the final metrics:
      • with no augmentations
      • with augmentations
  • train ResNeXt 50_32x4d on MFCC
    • Compare how augmentations affect the final metrics:
      • with no augmentations
      • with augmentations
  • OpenMIC guitars: use cleanlab and Kmeans to find guitars. Openmic has 1 guitar label. Take pretrained AST and do feature extraction on IRMAS train only on electric and aucustic guitar examples. Create a script which takes the AST features and creates Kmeans between two classes. Cluster OpenMIC guitars, take the most confident examples and save the examples (and new labels).
  • ⚠️ create a CSV which splits IRMAS validation to train and validation. First, group the .wav examples by the same song and find union of labels. Apply http://scikit.ml/stratification.html Multi-label data stratification to split the data.
  • use validation examples in train (without data leakage), check what's the total time of audio in train and val
  • augmentations: time shift, pitch shift, sox
    • add normalization after augmentations
  • add gradient/activation visualization for a predicted image
  • write summary of Wavelet transform and how it affects the results
  • Wav2Vec results, and train
  • write summary of LSTM results
  • implement argument which accepts list of numbers [1000, 500, 4] and will create appropriate deep cnn
    • use module called deep head and pass it as a argument
  • finish experiments and interpretation of the wavelet transformation
  • implement spectrogram cropping and zero padding instead of resizing
  • implement SVM model which uses classical audio features for mutlilabel classification
    • research if SVM can perform multilabel classification or use 11 SVMs
  • add more augmentations
  • check if wavelet works
  • implement chunking of the audio in inference and perform multiple forward pass
  • implement saving the embeddings of each model for visualizations using dimensionality reduction
  • think about and reserach what happens with variable sampling rate and how can we avoid issues with time length change, solution: chunking
  • add explained variance percentage in PCA
  • Create a script/notebook for plotting SVM results. There should be a total of 22 plots. You can reduce dimensionality with t-SNE and PCA from sklearn. Save the plots to .png so we can easily include it in the documentation
  • find features which show the highest amount of variance!
    • itterate through whole dataset and caculated featuers and save them. Then caculate variance for whole dataset for each feature
  • cleanup audio transform for spectrograms (remove repeat)
    • you still need to resize because the height isn't 224 (it's 128) but make sure the width is the same as the pretrained model image width
  • use caculate_spectrogram_duration_in_seconds to dynamically determine the audio length.
  • implement spectrogram normalization and std (norm,std) and use those paramters to preprocess the image before training.
  • implement Fluffy nn.Module
  • use Fluffy on Torch CNN, multi-head
  • train some model Fluffy
  • Wav2Vec2 feature extractor only
  • move spectrogram chunking to collate
  • prototype pretraining phase:
    • Shuffle parts of the spectrogram in the following way: (16x16 grid)
      • shuffle 15% of patches
      • electra, is the patch shuffled?
  • ESC50: download non instrument audio files and write data loader which are NOT instruments (@matej) this might not be important since the model usually gives [0,0,0,0,0] anyways: download ESC50 non instrument audio files and write data loader which are NOT instruments (@matej)
  • any dataset/csv loader
  • ⚠️ download the whole IRMAS dataset

πŸ“‹ Notes

General links:

Irmas dataset issues

Use cleanlab to find bad lables: https://docs.cleanlab.ai/stable/tutorials/audio.html?highlight=encoderclassifier

Train and validation dataset, move some validation examples to train

Do this without introducing data leakage, but make sure that we still have enough validation data.

Resizing and chunking

Chunking should happen only in inference in the following way:

  • preprocess 20sec audio normally, send the spectrogram to the model and chunk the spectrogram inside of the predict_step.

We don't do chunking in the train step because we can't chunk the label.

Time window of spectrogram is defined by maximum audio lenght of some train sample. If we chunk that sample, we don't know if the label will appear in every of those chunks.

Visualizations

Add low dim (t-Sne) plot of features to check clusters. How to that:

  • forward pass every example
  • now you have embedding
  • take t-sne

Pretraining

Masked Autoencoders (MAE)

https://huggingface.co/docs/transformers/model_doc/vit_mae#transformers.ViTMAEForPreTraining

Has script for pretrain but does it work? Written in nn.Module

Pretraining on CNN-s:

Adapter transformer training

Instead of training the transformer backbone, add layers in between the backbone and train those layers. Those layers are called adapters.

https://docs.adapterhub.ml/ https://docs.adapterhub.ml/adapter_composition.html

Normalization

Normalization of the audio in time domain (amplitude). Librosa already does this?

Spectrogram normalization, same as any image problem normalization - pre-caculate mean and std and use it in the preprocessing step.

🎡 Datasets

IRMAS dataset https://www.upf.edu/web/mtg/irmas:

  • IRMAS Test dataset only contains the information about presence of the instruments. Drums and music genre information is not present.
  • examples: 6705
  • instruments: 11
  • duration: 3sec

NSynth: Neural Audio Synthesis https://magenta.tensorflow.org/datasets/nsynth

  • examples: 305 979
  • instruments: 1006
  • A novel WaveNet-style autoencoder model that learns codes that meaningfully represent the space of instrument sounds.

MusicNet:

  • examples: 330
  • instruments: 11
  • duration: song

MedleyDB:

  • examples: 122
  • instruments: 80

OpenMIC-2018 https://zenodo.org/record/1432913#.W6dPeJNKjOR

Distance between classes

https://kevinmusgrave.github.io/pytorch-metric-learning/losses/ How to construct tripplets: https://omoindrot.github.io/triplet-loss Softmax loss and center loss: https://hav4ik.github.io/articles/deep-metric-learning-survey

Some instruments are similar and their class should be (somehow) close together.

Standard classification loss + (alpha * distance between two classes)

  1. distance is probably embedings from some pretrained audio model (audio transformer)

Tripplet loss, how do we form triplets

  1. real: guitar
  2. postive: guitar
  3. negative: not guitar?

Audio which are not instruments

Reserach audio files which are NOT instruments. Both background noises and sounds SIMILAR to instruments! Download the datasets and write dataset loader for them (@matej). Label everything [0, ..., 0]

πŸ’‘βš™οΈ Models and training

Problem: how to encode additional features (drums/no drums, music genre)? We can't create spectrogram out fo those arrays. Maybe simply append one hot encoded values after the spectrogram becomes 1D linear vector?

BEATs

Current state-of-the-art model for audio classification on multiple datasets and multiple metrics.

paper: https://arxiv.org/pdf/2212.09058.pdf github: https://github.com/microsoft/unilm/tree/master/beats https://paperswithcode.com/sota/audio-classification-on-audioset

AST

AST max duration is 10.23 sec for 16_000hz audio

Notes:

  • They used 16kHz audio for the pretrained model, so if you want to use the pretrained model, please prepare your data in 16kHz

Idea: introduce multiple MLP (fully conneted layer) heads. Each head will detect a single instrument instead of trying to detect all instruments at once.

Idea: train on single wav, then later introduce irmas_combinatorics dataset which contains multiple wav

LSTM and Melspectrograms (Mirko)

Trained LSTM (with and without Bahdanau attention) on melspectrogram and MFCC features, for single and multiple insturment classification. Adding instruments accroding to genre and randomly was also explored. This approach retains high accuracy due to the class imbalance of the train and validation set, however the F1 metric, with macro averaging in the multi instrument case, remains low in the 0.26 - 0.35 interval. All instruments with higher F1 metrics use Bahdanau attention.

LSTM and Wavelet (Mirko)

Aside from sliding wavelet filters, the output of the wavelet transform needs to be logsacled or preferably trasformed with amplitude_to_db. This does not seem to improve or degrade the performance of the LSTM model with attention, and the F1 score remains in similar margins. Still doing some resarch on Wavelets April 3rd...

Adding instruments (Mirko :( )

Adding instrument waveforms to imitate the examples with multiple insturments needs to be handled with greater care, otherwise it only improves the F1 metric slightly (LSTM) or even lowers it (Wav2Vec2 backbone). A bug was present that I did not catch before. I'm redoing the expereiments.

Fluffy

The idea was to implement a pretrained feature extractor with multiple FCNN (but not necessarily FCNN) heads that serve as disconected binary instrument classifiers. E.g. we wan to classify 5 instruments, hence we use a backbone with 5 FCNNs, and each FCNN searches for it's "own" instrument among the 5.

Fluffy with Wav2Vec2 feature extractor backbone

As was already mentioned, we used only the feature extractor of the pretrained Wav2Vec2 model, and completely disposed of the transformer component for effiency. Up untill this point, the training was performed for ~35 epochs and while the average validation f1 metric remains in the 0.5-0.6 region, it varies significantly across instruments. For most instruments the f1 score remains in the 0.6-0.7 range with numerous outliers, on the high end we have the acoustic guitar and the human voice with f1 above 0.8. This is to be expected, considering the backbone was trained on many instances of human voices. On the low end we have the organ with f1 of ~0.2, and most likely due do bugs in the code the electric guitar with f1 of 0. This could also be atributed to it's similarity with other instruments such as violin or acoustic guitar. This leaves us with a "deathrattle" of sort for this whole "let's use only IRMAS" idea. The idea is to pretrain a feature extractor based on contrastive loss, aslo margins within genres and instrument families should be applied. If this doesn't produce better results the only solution I propose is getting more data, e.g. open MIC.

Fluffy with entire Wav2Vec2

This model has been trained for far fewer epochs ~7, and so far it exhibits the same issues as Fluffy with just the feature extractor. Perhaps more training would be needed, however using such large models requires considerable memory usage, and it's use durign inference time might be limited.

Parallel Mobilenets

  • create 4 Mobilenets which cover 11 instruments
  • forward pass to get features
  • create 4 FC (each FC has 3 instruments)
  • concat predictions
  • create 4 Mobilenets which cover 11 instruments
  • forward pass to get features
  • concat all features

SVM

Introduce SVM and train it additionally on high level features of spectrogram (MFCC). For example, one can caculate entropy of a audio/spectrogram for a given timeframe (@vinko)

If you have audio of 3 sec, caculate ~30 entropies every 0.1 sec and use those entropies as SVM features. Also try using a lot more librosa features.

βž• Ensamble

Ensamble should be features of some backbone and Vinko's SVM.

Audio knowledge

Harmonic and Percussive Sounds

https://www.audiolabs-erlangen.de/resources/MIR/FMP/C8/C8S1_HPS.html

Loosely speaking, a harmonic sound is what we perceive as pitched sound, what makes us hear melodies and chords. The prototype of a harmonic sound is the acoustic realization of a sinusoid, which corresponds to a horizontal line in a spectrogram representation. The sound of a violin is another typical example of what we consider a harmonic sound. Again, most of the observed structures in the spectrogram are of horizontal nature (even though they are intermingled with noise-like components). On the other hand, a percussive sound is what we perceive as a clash, a knock, a clap, or a click. The sound of a drum stroke or a transient that occurs in the attack phase of a musical tone are further typical examples. The prototype of a percussive sound is the acoustic realization of an impulse, which corresponds to a vertical line in a spectrogram representation.

πŸ”Š Feature extraction

https://pytorch.org/audio/stable/transforms.html https://pytorch.org/audio/stable/functional.html#feature-extractions

Spectrogram

note: in practice, Mel Spectrograms are used instead of classical spectrogram. We have to normazlie spectrograms images just like any other image dataset (mean/std).

https://www.physik.uzh.ch/local/teaching/SPI301/LV-2015-Help/lvanls.chm/STFT_Spectrogram_Core.html#:~:text=frequency%20bins%20specifies%20the%20FFT,The%20default%20is%20512.

Take an audio sequence and peform SFTF (Short-time Fourier transform) to get spectrums in multiple time intervals. The result is a 3D tensor (time, amplitude, spectrum). STFT has a time window size which is defined by a sampling frequnecy. It is also defined by a window type.

Mel-Frequency Cepstral Coefficients (MFCC)

Spectrogram of Mel Spectrogram:

https://youtu.be/4_SH2nfbQZ8

πŸ₯΄ Augmentations

Audio augmentations

  • white noise
  • time shift
  • amplitude change / normalization
PyTorch Sox effects

allpass, band, bandpass, bandreject, bass, bend, biquad, chorus, channels, compand, contrast, dcshift, deemph, delay, dither, divide, downsample, earwax, echo, echos, equalizer, fade, fir, firfit, flanger, gain, highpass, hilbert, loudness, lowpass, mcompand, norm, oops, overdrive, pad, phaser, pitch, rate, remix, repeat, reverb, reverse, riaa, silence, sinc, speed, stat, stats, stretch, swap, synth, tempo, treble, tremolo, trim, upsample, vad, vol

Spectrum augmentations

SpecAugment: https://ai.googleblog.com/2019/04/specaugment-new-data-augmentation.html SpecAugment PyTorch: https://github.com/zcaceres/spec_augment SpecAugment torchaudio: https://pytorch.org/audio/main/tutorials/audio_feature_augmentation_tutorial.html#specaugment

πŸ”€ Data generation

Naive: concat multiple audio sequences into one and merge their labels. Introduce some overlapping, but not too much!

Use the same genre for data generation: combine sounds which come from the same genre instead of different genres

How to sample?

  • sample audio files [3, 5] but dont use more than 4 instruments
  • sample different starting positions at which the audio will start playing
    • START-----x---x----------x--------x----------END
  • cutoff the audio sequence at max length?

Torch, Librosa and Kaldi

Librosa and Torch give the same array (sound) if both are normalized and converted to mono.

Librosa is gives same array if you load it with sr=None, resample compared to resampling on load.

Best results for AST feature extraction, use torchaudio.load with resampling.

Kaldi

window_shift = int(sample_frequency * frame_shift * 0.001) window_size = int(sample_frequency * frame_length * 0.001)

Librosa hop_length #ms #len 1/(1 / 44100 * 1000) * 20

with a 25ms Hamming window every 10ms (hop)

nfft = 1/(1 / 44100 * 1000) * 25 = 1102 hop = 1/(1 / 44100 * 1000) * 10 = 441

About

🎸 Audio Instrument Classification – 2nd place solution – Lumen Data Science 2023

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •