Skip to content
craffel edited this page Jul 24, 2014 · 9 revisions

Principles

mir_eval should implement common metrics used to evaluate music information retrieval and audio signal processing algorithms. All metrics should be based on a pre-existing publication. Each implementation should be well-documented and as "transparent" as possible, so it is easy to understand how the metric is being computed and to make changes to metrics. mir_eval should also be modular, so that common tasks across metrics have their own functions, both to prevent duplication of code but also so that certain subtasks can be replaced easily.

  • Each metric for each task has its own function in mir_eval.task. The metric's function should not do any loading of data or preprocessing, it should work on raw annotations.
  • All shared/non-domain-specific functionality (e.g. F-measure, sampling intervals) should be in mir_eval.util.
  • Any shared functionality across metrics of a single task which are not meant to be used outside of the context of computing a metric should be defined in underscore functions of that task’s submodule.
  • Any task-specific preprocessing functions should go in the task’s submodule.
  • Each task should have an evaluator which performs all data loading, preprocessing, and evaluation. The evaluators should not define any new functionality, but instead should provide a usage example and a black-box system for going from annotation to score.
  • All metric functions should have example usage in their docstring which includes loading/pre-processing.
  • Each task should include a decorator which is applied to each metric function for validating input data

Submodules

mir_eval has submodules both for evaluation of specific tasks and for common functionality/utility.

Task-specific

These submodules contain metrics for a specific MIR/signal processing "task", to be used for quantitative analysis.

beat

The beat submodule replicates the functionality of the beat evaluation toolbox.

melody

The melody submodule implements all metrics from "Melody Extraction from Polyphonic Music Signals: Approaches, Applications and Challenges".

segment

The segment submodule implements all of the MIREX metrics, that is: boundary retrieval recall rate, boundary retrieval precision rate, and boundary retrieval F-measure as used in "A Supervised Approach for Detecting Boundaries in Music Using Difference Features and Boosting" and " Structural Segmentation of Musical Audio by Constrained Clustering", pairwise precision rate, pairwise recall rate, and pairwise F-measure also from " Structural Segmentation of Musical Audio by Constrained Clustering", normalized conditional entropy-based over- and under-segmentation scores as described in "Towards Quantitative Measures of Evaluating Song Segmentation" and the clustering Rand index.

separation

The separation submodule replicates the functionality of the BSS-eval toolbox

onset

The onset submodule computes the precision, recall, and f-measure of the sampled onset times as described in "Evaluating the Online Capabilities of Onset Detection Methods".

chord

The chord submodule contains functionality for mapping chords into different dialects (e.g. min/maj, triads, quads, etc.) and for computing the frame-wise accuracy.

pattern

The pattern submodule contains various functions to compute the standard f-measure, precision and recall (F, P, R), the establishment f-measure, precision and recall (F_est, P_est, R_est), the occurrence f-measure, precision and recall (F_occ, P_occ, R_occ), the three layer f-measure, precision and recall (F_3, P_3, R_3), and the first five target proportion metric (FFP).

Generic/utility

These submodules contain shared functionality across tasks and qualitative analysis.

sonify

Functions for creating audio signals based on algorithm output, including synthesizing clicks for temporal events, synthesizing chords and synthesizing chromagrams.

display

Functions for plotting the output of different algorithms.

util

Utility functions (pre-processing, basic metrics) shared across tasks.

io

For reading in data from files

Evaluation chains

beat:

  • mir_eval.io.load_events
  • mir_eval.beat._clean_beats
  • compute all metrics

segment:

  • mir_eval.io.load_annotation
  • mir_eval.util.adjust_intervals
  • cmpute all metrics

melody:

  • mir_eval.io.load_time_series
  • mir_eval.melody.hz2cents . . .

onset:

  • mir_eval.io.load_events
  • compute all metrics

separate:

  • load audio either with scipy.io or librosa
  • resample either with scipy.signal or librosa
  • make signals the same length
  • compute bss_eval metrics

chord:

  • mir_eval.io.load_annotation
  • Reduce chord alphabet
  • Sample sequences
  • Score frame-by-frame

pattern:

  • mir_eval.io.load_patterns
  • compute all metrics

Coding style

Should get a perfect pylint score. Docstrings everywhere in sphinx format.