Skip to content

RicherMans/CED

Repository files navigation

Consistent Ensemble Distillation for Audio Tagging (CED)

This repo is the source for the ICASSP 2024 Paper Consistent Ensemble Distillation for Audio Tagging.

Framework

Model Parameters (M) AS-20K (mAP) AS-2M (mAP)
CED-Tiny 5.5 36.5 48.1
CED-Mini 9.6 38.5 49.0
CED-Small 22 41.6 49.6
CED-Base 86 44.0 50.0
  • All models work with 16 kHz audio and use 64-dim Mel-spectrograms, making them very fast. CED-Tiny should be faster than MobileNets on a single x86 CPU (even though MACs/FLops would indicate otherwise).

Pretrained models could be downloaded from Zenodo or Hugging Face.

Zenodo Hugging Face
CED-Tiny Link Link
CED-Mini Link Link
CED-Small Link Link
CED-Base Link Link

Demo

We have an online demo available here for CED-Base.

Inference/Usage

To just use the CED models for inference, simply run:

git clone https://github.com/Richermans/CED/
cd CED/
pip3 install -r requirements.txt
python3 inference.py resources/*

Note that I experienced some problems with higher versions of hdf5, so if possible please use 1.12.1.

By default we use CED-mini here, which offers a good trade-off between performance and speed. One can switch the models with the -m flag:

python3 inference.py -m ced_tiny resources/*
python3 inference.py -m ced_mini resources/*
python3 inference.py -m ced_small resources/*
python3 inference.py -m ced_base resources/*

You can also use the models directly from Hugging Face, see here for usage instructions.

Training/Reproducing results

1. Preparing data

First, one needs to download Audioset. One might use one of our own scripts.

For example, one can put the downloaded files into a folder named data/balanced and data/unbalanced, data/eval such as:

data/balanced/
├── -0DdlOuIFUI_50.000.wav
├── -0DLPzsiXXE_30.000.wav
├── -0FHUc78Gqo_30.000.wav
├── -0mjrMposBM_80.000.wav
├── -0O3e95y4gE_100.000.wav
…

data/unbalanced/
├── --04kMEQOAs_0.000_10.000.wav
├── --0aJtOMp2M_30.000_40.000.wav
├── --0AzKXCHj8_22.000_32.000.wav
├── --0B3G_C3qc_10.000_20.000.wav
├── --0bntG9i7E_30.000_40.000.wav
…

data/eval/
├── 007P6bFgRCU_10.000_20.000.wav
├── 00AGIhlv-w0_300.000_310.000.wav
├── 00FBAdjlF4g_30.000_40.000.wav
├── 00G2vNrTnCc_10.000_20.000.wav
├── 00KM53yZi2A_30.000_40.000.wav
├── 00XaUxjGuX8_170.000_180.000.wav
├── 0-2Onbywljo_380.000_390.000.wav

Then just generate a .tsv file with:

find data/balanced/ -type f | awk 'BEGIN{print "filename"}{print}' > data/balanced.tsv

Then dump the data as hdf5 files using scripts/wavlist_to_hdf5.py:

python3 scripts/wavlist_to_hdf5.py data/balanced.tsv data/balanced_train/

This will generate a training datafile data/balanced_train/labels/balanced.tsv.

For the eval data, please use this script to download.

The resulting eval.tsv should look like this:

filename	labels	hdf5path
data/eval/--4gqARaEJE.wav	73;361;74;72	data/eval_data/hdf5/eval_0.h5
data/eval/--BfvyPmVMo.wav	419	data/eval_data/hdf5/eval_0.h5
data/eval/--U7joUcTCo.wav	47	data/eval_data/hdf5/eval_0.h5
data/eval/-0BIyqJj9ZU.wav	21;20;17	data/eval_data/hdf5/eval_0.h5
data/eval/-0Gj8-vB1q4.wav	273;268;137	data/eval_data/hdf5/eval_0.h5
data/eval/-0RWZT-miFs.wav	379;307	data/eval_data/hdf5/eval_0.h5
data/eval/-0YUDn-1yII.wav	268;137	data/eval_data/hdf5/eval_0.h5
data/eval/-0jeONf82dE.wav	87;137;89;0;72	data/eval_data/hdf5/eval_0.h5
data/eval/-0nqfRcnAYE.wav	364	data/eval_data/hdf5/eval_0.h5

2. Download logits

Download the logits used in the paper from Zenodo:

wget https://zenodo.org/record/8275347/files/logits.zip?download=1 -O logits.zip
unzip logits.zip

This will create:

logits/
└── ensemble5014
    ├── balanced
    │   └── chunk_10
    └── full
        └── chunk_10

3. Train

python3 run.py train trainconfig/balanced_mixup_tiny_T_ensemble5014_chunk10.yaml

Export ONNX and Inference ONNX

python3 export_onnx.py -m ced_tiny
#or ced_mini ced_small ced_base
python3 onnx_inference_with_kaldi.py test.wav -m ced_tiny.onnx
python3 onnx_inference_with_torchaudio.py test.wav -m ced_tiny.onnx

Why use Kaldi to calculate Mel features? Because it has ready-made C++ implementation code, which can be found here: https://github.com/csukuangfj/kaldi-native-fbank/tree/master

Training on your own data

This is a label-free framework, meaning that any data can be used for optimization. To use your own data, do the follwing:

Put your data somewhere and generate a .tsv file with a single header filename, such as:

find some_directory -type f | awk 'BEGIN{print "filename"}{print}' > my_data.tsv

Then dump the corresponding hdf5 file using scripts/wavlist_to_hdf5.py:

python3 scripts/wavlist_to_hdf5.py my_data.tsv my_data_hdf5/

Then run the script save_logits.py as:

torchrun save_logits.py logitconfig/balanced_base_chunk10s_topk20.yaml --train_data my_data_hdf5/labels/my_data.tsv

Finally you can train your own model on that augmented dataset with:

python3 run.py train trainconfig/balanced_mixup_base_T_ensemble5014_chunk10.yaml --logitspath YOUR_LOGITS_PATH --train_data YOUR_TRAIN_DATA.tsv

Hear-Evaluation

We also submitted the models for the HEAR benchmark evaluation. Hear uses a simple linear downstream evaluation protocol across 19 tasks. We simply extracted the features from all ced-models from the penultimate layer. The repo can be found here.

Model Beehive States Avg Beijing Opera Percussion CREMA-D DCASE16 ESC-50 FSD50K GTZAN Genre GTZAN Music Speech Gunshot Triangulation LibriCount MAESTRO 5hr Mridangam Stroke Mridangam Tonic NSynth Pitch 50hr NSynth Pitch 5hr Speech Commands 5hr Speech Commands Full Vocal Imitations VoxLingua107 Top10
ced-tiny 38.345 94.90 62.52 88.02 95.80 62.73 89.20 93.01 91.67 61.26 4.81 96.13 90.74 69.19 44.00 70.53 77.10 19.18 33.64
ced-mini 59.17 96.18 65.26 90.66 95.35 63.88 90.30 94.49 86.01 64.02 8.29 96.56 93.32 75.20 55.60 77.38 81.96 20.37 34.67
ced-small 51.70 96.60 66.64 91.63 95.95 64.33 89.50 91.22 93.45 65.59 10.96 96.82 93.94 79.95 60.20 80.92 85.19 21.92 36.53
ced-base 48.35 96.60 69.10 92.19 96.65 65.48 88.60 94.36 89.29 67.85 14.76 97.43 96.55 82.81 68.20 86.93 89.67 22.69 38.57

Android APks

Thanks to csukuangfj, there are also pre-compiled android binaries using sherpa-onnx.

Binaries are available on the k2-fsa sherpa page, https://k2-fsa.github.io/sherpa/onnx/audio-tagging/apk.html.

Citation

Please cite our paper if you find this work useful:

@inproceedings{dinkel2023ced,
  title={CED: Consistent ensemble distillation for audio tagging},
  author={Dinkel, Heinrich and Wang, Yongqing and Yan, Zhiyong and Zhang, Junbo and Wang, Yujun},
  booktitle={ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2024}
}

About

Source code for Consistent ensemble distillation for audio tagging

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages