hear-preprocess

Dataset preprocessing code for the HEAR Benchmark and for all the tasks used during the 2021 HEAR NeurIPS challenge. To find out more about HEAR please visit https://hearbenchmark.com.

Unless you need to pre-process HEAR benchmark tasks yourself or want to contribute a task, you won't need this repo. Use hear-eval-kit to evaluate your embedding models on these tasks.

Pre-processed datasets (at 48000Hz) for all HEAR Benchmark tasks are available on zenodo. Other sampling rates (16000, 22050, 32000, 44100), are available for download (requester pays) from Google Storage gs://hear2021-archive/tasks/

This preprocessing is slow and disk-intensive but safe and careful.

Cloud Usage

See hear-eval's README.spotty for information on how to use spotty.

Installation

pip3 install hearpreprocess

Tested with Python 3.7 and 3.8. Python 3.9 is not officially supported because pip3 installs are very finicky, but it might work.

Development

Clone repo:

git clone https://github.com/hearbenchmark/hear-preprocess
cd hear-preprocess

Install in development mode:

pip3 install -e ".[dev]"

Make sure you have pre-commit hooks installed:

pre-commit install

Running tests:

python3 -m pytest

Preprocessing

You probably don't need to do this unless you can't use the available pre-processed datasets and need to preprocess the data yourself..

If you want to run preprocessing yourself:

You will need ffmpeg>=4.2 installed (possibly from conda-forge).
You will need soxr support, which might require package libsox-fmt-ffmpeg or installing from source.

These Luigi pipelines are used to preprocess the evaluation tasks into a common format for downstream evaluation.

To run the preprocessing pipeline for all available tasks, with all available modes for each task:

python3 -m hearpreprocess.runner all --mode all

You can instead just call a specific single task

python3 -m hearpreprocess.runner task1 --mode all

or specific multiple tasks:

python3 -m hearpreprocess.runner task1 task2 --mode all

Tasks

List of available tasks used in HEAR 2021:

Task Name	Modes
dcase2016_task2	full
nsynth_pitch	5h, 50h
speech_commands	5h, full
beehive_states_fold0	5h, full
beehive_states_fold1	5h, full
beijing_opera	full
esc50	full
fsd50k	full
gunshot_triangulation	full
libricount	full
maestro	5h
mridangam_stroke	full
mridangam_tonic	full
tfds_crema_d	full
tfds_gtzan	full
tfds_gtzan_music_speech	full
vocal_imitation	full
vox_lingua_top10	full

Pipelines

Each pipeline will download and preprocess each dataset according to the following DAG:

DownloadCorpus
ExtractArchive
ExtractMetadata: Create splits over the entire corpus and find the label metadata for them.
SubcorpusSplit (subsample each split) => MonoWavSplit => TrimPadSplit => SubcorpusData (symlinks)
SubcorpusData => {SubcorpusMetadata, ResampleSubcorpus}
SubcorpusMetadata => MetadataVocabulary
FinalCombine => TarCorpus => FinalizeCorpus

In terms of sampling:

We create a 60/20/20 split if train/valid/test does not exist.
We cap each split at 3/1/1/ hours of audio, defined as
If further small sampling happens, that chooses a particular number of audio samples per task.

These commands will download and preprocess the entire dataset. An intermediary directory defined by the option luigi-dir(default _workdir) will be created, and then a final directory defined by the option tasks-dir (default tasks) will contain the completed dataset.

Options:

Options:
  --num-workers INTEGER  Number of CPU workers to use when running. If not
                         provided all CPUs are used.
  --sample-rate INTEGER  Perform resampling only to this sample rate. By
                         default we resample to 16000, 22050, 44100, 48000.
  --tmp-dir TEXT         Temporary directory to save all the intermediate
                         tasks (will not be deleted afterwords). (default:
                         _workdir/)
  --tasks-dir TEXT       Directory to save the final task output (default:
                         tasks/)
  --tar-dir TEXT         Directory to save the tar'ed output (default: .)
  --mode TEXT            default, all, or small mode for each task.
  --help                 Show this message and exit.

To check the stats of an audio directory:

python3 -m hearpreprocess.audio_dir_stats {input folder} {output json file}

Stats include: audio_count, audio_samplerate_count, mean meadian and certain (10, 25, 75, 90) percentile durations. This is helpful in getting a quick glance of the audio files in a folder and helps in decideing the preprocessing configurations.

The pipeline will also generate some stats of the original and preprocessed data sets, e.g.:

speech_commands-v0.0.2/01-ExtractArchive/test_stats.json
speech_commands-v0.0.2/01-ExtractArchive/train_stats.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_test.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_train.json
speech_commands-v0.0.2/03-ExtractMetadata/labelcount_valid.json

Faster preprocessing, for development

The small flag runs the preprocessing pipeline on a small version of each dataset stored at Downsampled HEAR Open Tasks. This is used for development and continuous integration tests for the pipeline.

These small versions of the data can be generated deterministically with the following command:

python3 -m hearpreprocess.sampler <taskname>

NOTE : --mode small is used to run the task on a small version of the dataset for development.

Breaking change for hear-eval

If the open tasks have changed enough to break the downstream CI, (for example in the heareval repo), the Preprocessed Downsampled HEAR Open Tasks should be updated. An example of an obvious breaking changes can be modification of the task configuration.

The version should be bumped up in hearpreprocess/__init__.py and the pipeline should be run for the open tasks with --mode small flag

Thereafter, the following command can be used to copy the tarred files produced by running the pipeline for the open tasks to the repo( Please clone the repo )

git clone git@github.com:hearbenchmark/hear2021-open-tasks-downsampled.git
cp hear-LATEST-speech_commands-v0.0.2-small-44100.tar.gz ./hear2021-open-tasks-downsampled/preprocessed/
cp hear-LATEST-nsynth_pitch-v2.2.3-small-44100.tar.gz ./hear2021-open-tasks-downsampled/preprocessed/
cp hear-LATEST-dcase2016_task2-hear2021-small-44100.tar.gz ./hear2021-open-tasks-downsampled/preprocessed/
cp hear-2021.0.6-speech_commands-v0.0.2-small-44100.tar.gz ./hear2021-open-tasks-downsampled/preprocessed/
cp hear-2021.0.6-nsynth_pitch-v2.2.3-small-44100.tar.gz ./hear2021-open-tasks-downsampled/preprocessed/
cp hear-2021.0.6-dcase2016_task2-hear2021-small-44100.tar.gz ./hear2021-open-tasks-downsampled/preprocessed/

Name		Name	Last commit message	Last commit date
Latest commit History 1,472 Commits
.github/workflows		.github/workflows
hearpreprocess		hearpreprocess
tests		tests
.deepsource.toml		.deepsource.toml
.flake8		.flake8
.gitignore		.gitignore
.mypy.ini		.mypy.ini
.pre-commit-config.yaml		.pre-commit-config.yaml
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
clean.sh		clean.sh
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py
spotty.yaml.tmpl		spotty.yaml.tmpl

License

hearbenchmark/hear-preprocess

Folders and files

Latest commit

History

Repository files navigation

hear-preprocess

Cloud Usage

Installation

Development

Preprocessing

Tasks

Pipelines

Faster preprocessing, for development

Breaking change for hear-eval

About

Resources

License

Stars

Watchers

Forks

Languages