perfect 5th detection in mir_eval.key is asymmetric #337

iansimon · 2021-03-02T00:04:09Z

Line 145 in 576aad4

# If keys are the same mode and a perfect fifth (differ by 7 semitones)

Is this intended to be asymmetric? E.g. if estimated = G major and reference = C major the score is 0.5, but if estimated = C major and reference = G major the score is zero.

craffel · 2021-03-02T01:53:17Z

I do not know music theory :( @bmcfee ?

bmcfee · 2021-03-02T11:19:40Z

From reading the comments, I think this is behaving to spec / consistent with the mirex implementation.

I also think it's weird, but 🤷

bmcfee · 2021-03-02T13:58:33Z

To follow up on this a bit, yes, it's intended to be asymmetric, and the docstring is unambiguous about the behavior:

mir_eval/mir_eval/key.py

Lines 99 to 115 in 576aad4

    
           def weighted_score(reference_key, estimated_key): 
        
               """Computes a heuristic score which is weighted according to the 
        
               relationship of the reference and estimated key, as follows: 
        
               +------------------------------------------------------+-------+ 
        
               | Relationship                                         | Score | 
        
               +------------------------------------------------------+-------+ 
        
               | Same key and mode                                    | 1.0   | 
        
               +------------------------------------------------------+-------+ 
        
               | Estimated key is a perfect fifth above reference key | 0.5   | 
        
               +------------------------------------------------------+-------+ 
        
               | Relative major/minor (same key signature)            | 0.3   | 
        
               +------------------------------------------------------+-------+ 
        
               | Parallel major/minor (same key)                      | 0.2   | 
        
               +------------------------------------------------------+-------+ 
        
               | Other                                                | 0.0   | 
        
               +------------------------------------------------------+-------+

I suppose this was originally intended to break symmetry between 5th and 4th estimation errors by only considering movement in one direction from the root. I did try to dig into this more in the mirex wiki, but there's not much provenance around how this metric came about. Maybe @jpauwels has some insight, as I think he was the most recent task caption for key detection?

jpauwels · 2021-03-08T22:55:19Z

Hi all, as far as I know, it was never intended to make a distinction between perfect fifth and perfect fourth errors (as long as the mode is the same). It certainly isn't the case in the current implementation used for MIREX. That doesn't mean it's impossible that there is some ambiguously worded description somewhere on the wiki. Are you referring to the following table: https://www.music-ir.org/mirex/wiki/2005:Audio_and_Symbolic_Key (copied over every year since)? In that case, it should be read as perfect fifth either way, not just perfect fifth above.

The canonical reference I use for this measure is Emilia Gomez's PhD thesis (section 4.3.1.2), but I suspect the exact weighting was determined collaboratively with other participants of MIREX 2005. Between us, I've never been a fan of this particular measure, too many magic numbers in there. I always suggest listing the error categories separately and did so in my own papers, but I still include this combination measure in MIREX for historical comparisons.

stefan-baumann · 2021-07-23T14:22:54Z

Seconding what @jpauwels said, he actually stated that in the 2017 MIREX results, too:

The NEMA system was retired this year, since a bug has been found in the calculation of the results. Keys with tonics related by a fifth and the same mode (a.k.a. adjacent keys) are supposed to get a score of 0.5, but only ascending fifths (going from ground-truth to estimation) were counted, not descending ones. It has been brought to my attention that the description of the measure on the wiki has been ambiguous for years, and probably the NEMA implementer got confused by this. However, the intention has always been to count ascending and descending fifth (or fourth) relationships between the tonics (in my humble opinion).

This is taken directly from the 2017 MIREX Key Detection results (https://www.music-ir.org/mirex/wiki/2017:Audio_Key_Detection_Results)

So apparently the "official" behavior changed from only counting ascending ones to also counting descending ones in 2017. It would be really useful if both approaches were implemented, maybe even adding a warning to the "old" approach that notifies users that they are using a metric that might not correspond with the one they're expecting.

I personally just spent multiple days trying to find a bug on my end of the code while reproducing prior MIREX results, only to find out that the one thing where I didn't expect the cause of the issue (who would expect the MIR algorithm evaluation library to output different results than the MIR algorithm evaluation campaign) is actually the cause. Adding a warning to the "old" approach could be really useful for cases like mine without breaking prior code.

I'll happily submit a PR adding a second method or additional argument and a warning incase the "old" method is used if this is the direction that the maintainers want to go with this.

bmcfee · 2021-07-23T15:22:36Z

So apparently the "official" behavior changed from only counting ascending ones to also counting descending ones in 2017. It would be really useful if both approaches were implemented, maybe even adding a warning to the "old" approach that notifies users that they are using a metric that might not correspond with the one they're expecting.

I would support this change. AFAIK the mir_eval implementation was based on mirex in ~2014, so if mirex changed later, it's not surprising that we got out of sync.

'll happily submit a PR adding a second method or additional argument and a warning incase the "old" method is used if this is the direction that the maintainers want to go with this.

🚀 please do!

stefan-baumann · 2021-07-23T16:47:55Z

@bmcfee Here you go - I implemented it in the way I would've intuitively done it and created a draft PR (if the feedback on the API choice is positive, I'll implement the tests and mark it as ready) for it at #339

stefan-baumann mentioned this issue Jul 23, 2021

[WIP] Implement current MIREX key scoring method and warn if using old one #339

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perfect 5th detection in mir_eval.key is asymmetric #337

perfect 5th detection in mir_eval.key is asymmetric #337

iansimon commented Mar 2, 2021

craffel commented Mar 2, 2021

bmcfee commented Mar 2, 2021

bmcfee commented Mar 2, 2021

jpauwels commented Mar 8, 2021

stefan-baumann commented Jul 23, 2021 •

edited

bmcfee commented Jul 23, 2021

stefan-baumann commented Jul 23, 2021 •

edited

perfect 5th detection in mir_eval.key is asymmetric #337

perfect 5th detection in mir_eval.key is asymmetric #337

Comments

iansimon commented Mar 2, 2021

craffel commented Mar 2, 2021

bmcfee commented Mar 2, 2021

bmcfee commented Mar 2, 2021

jpauwels commented Mar 8, 2021

stefan-baumann commented Jul 23, 2021 • edited

bmcfee commented Jul 23, 2021

stefan-baumann commented Jul 23, 2021 • edited

stefan-baumann commented Jul 23, 2021 •

edited

stefan-baumann commented Jul 23, 2021 •

edited