PICARD-2655: Improve matching for VA releases #1918

zas · 2021-10-13T13:00:21Z

if an album was incorrectly tagged with a label name or whatever, instead of Various Artists
Picard tries to match Various Artists (from database) to the name, usually leading to a very low similarity, reducing a lot the chance to find the correct release
a contrario, if an album was tagged with Various Artists, it is very likely it is a VA compilation, so increase the weight

Summary

This is a…
- Bug fix
- Feature addition
- Refactoring
- Minor / simple change (like a typo)
- Other
Describe this change in 1-2 sentences:

Problem

JIRA ticket (optional): PICARD-2655

Solution

Action

picard/metadata.py

- if an album was incorrectly tagged with a label name or whatever, instead of Various Artists Picard tries to match Various Artists (from database) to the name, usually leading to a very low similarity, reducing a lot the chance to find the correct release - a contrario, if an album was tagged with Various Artists, it is very likely it is a VA compilation, so increase the weight

Sophist-UK · 2023-06-11T17:00:24Z

picard/metadata.py

+                # it is fairly common VA release are tagged with label's name
+                # so if a release is credited to VA, assume it is similar
+                # to artist in tag
+                sim = 1.0


I am entirely unsure that this logic is at all correct for various reasons (though it may equally be that I don't understand how this is supposed to work) but:

I think that we should be considering a situation where we give the artist similarity a value of 1 ONLY if the album artist is "various" or "various artists" AND the release we are matching has a VA_ID. As the code stands if the album artist is "The Beatles" and the release we are matching is VA, artist similarity is 1. Or are we going to assume that any release which is VA is effectively going to match any album artist regardless of whether it is "various"?

What happens if the album is not in english e.g. the album is a portuguese language album and the album artist is "vários artistas"? I think we need a much longer list of VA textual names which covers all languages that we support. (Edit: Not sure how text comparisons work where there are accents - but we might want not only to make a lower case comparison as in the current code, but also to remove accents.)

I am unclear why we are doubling the weighting for artist if the album is "various" rather than halving it. In the specific use case described in the PR description, if the album artist is "various" and it is failing to match a release because the release artist is the label rather than "various" then surely we should give less weight to the artists not matching rather than more weight?

Do we need some unit tests with various albums and various potential releases to test that we get the best release selected?

Edited: to add note about accented characters.

I'd love have tests for that... thanks for volunteering ;)

I am happy to volunteer to write tests for this providing that you are prepared to wait until I have the time to spend on open source and for this to get to the top of my queue of open source stuff I have planned. 😉

I agree with Sophist-UK that we likely should combine the tests of tags indicating VA and the release being VA.

The goal of the patch is to improve the matching of files from a VA release to VA. And one of the problems here is that pure matching of album artist tag with the MB artist name ("Various Artists") often does not yield good results.

The code will indeed improve the situation. But at the same time it will make matching a VA release from MB to any file more likely, as all VA releases will always have full albumartist similarity.That might increase the problem that files get tagged too much against compilation, something many users already struggle with.

Combining the check of existing album artist tags indicating VA and the release being VA would avoid this. For the localization I would suggest to check for a.lower() in {'various', 'various artists', config.setting['va_name'].lower()}.

This will not solve the case where the album artist is set to the label name or something completely different, though.

@phw

A. I am really not sure that we should use config.setting['va_name'] as this is stated in the documentation to be the output name when you use a VA_ID release - not the least for file naming so that the user can group together albums) - so this is (usually) the users local idiom for VA. But (clearly) a user can be trying to match a VA album from a different country, so Picard needs to recognise these. This is why I suggested a list (constant -> one off effort) of V/VA for every locale supported. And this same list could then be used by MB for helping improve the data, so the same effort could then also be reused. (P.S. It won't be perfect as a first iteration, but Google / Deepl translate can give you a first cut for translations of VA without too much effort. If it will help I can find the time do this and post the results here for reuse.)

B. There are probably four use cases to consider, two where the album artist is set to the label name:

The album artist MBID is the special VA_ID but the album artist text is set to something other than (one of the translations of) "VA". Here Picard can know for certain that the release is a VA. The PR description has album artist as label as the use case, but the current code seems to use this more general use case.

The album artist MBID is not the special VA_ID even though the album is in fact a VA. I am not sure that we can do anything in Picard about this because in most cases (see exception cases that follow) it is indistinguishable from a non-VA album.

Special case of 2., where the MBID is not VA_ID but the text album artist is "Various Artists" (or a translation). Here Picard can also reasonably assume that this is a VA.

Special case of 2., where the MBID is not VA_ID but the album artist text is the same as (or very similar to) the label. This is another variation of the PR description use case, and perhaps Picard should try to fix this too, especially if Picard is not going to tweak the data but rather just tweak the fuzzy search.

Of course the principle cause of all three use cases is bad data - so the primary fix should be to have an MB report to identify use cases 1, 3 and 4 and have editors fix the data. (Or indeed, have some code for case 3 to auto-create (voteable) edits to fix the album artist MBID.) However, as per this PR Picard should still try to handle bad data because it will never be perfect - and the code should probably deal with cases 1 and 3 and perhaps also 4 (since it is only tweaking the fuzzy match and not making hard changes to the data).

C. That said, we should also consider whether we want Picard to change the data if this release is chosen. If we genuinely believe that use cases 1, 2 and 4 are detectable and we are tweaking the fuzzy search, should we also tweak the MB data to "improve" it? (This is much more contentious, and I put this up for discussion because I have not yet come to an ethical/practical personal conclusion.) If we do this, we would have to do it elsewhere in the code because once the Album has an Album MBID, fuzzy search code is not executed.

D. I think that fuzzy match fudges like this to attempt to correct bad MBID data should have an options switch to enable / disable them. And if we decide to fix the data too, that definitely should have a switch.

zas force-pushed the va_match branch from a76101e to 4ef9846 Compare October 13, 2021 13:05

zas commented Oct 13, 2021

View reviewed changes

picard/metadata.py Show resolved Hide resolved

zas added the Decision required label Jun 11, 2023

zas force-pushed the va_match branch from 4ef9846 to 5650862 Compare June 11, 2023 16:26

zas force-pushed the va_match branch from 5650862 to 86fee33 Compare June 11, 2023 16:38

zas changed the title ~~Improve matching for VA releases~~ PICARD-2655: Improve matching for VA releases Jun 11, 2023

zas requested a review from phw June 11, 2023 16:39

zas marked this pull request as ready for review June 11, 2023 16:39

Sophist-UK reviewed Jun 11, 2023

View reviewed changes

zas added the Do not merge yet label Jun 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PICARD-2655: Improve matching for VA releases #1918

PICARD-2655: Improve matching for VA releases #1918

zas commented Oct 13, 2021 •

edited

Sophist-UK Jun 11, 2023 •

edited

zas Jun 11, 2023

Sophist-UK Jun 11, 2023

phw Jun 12, 2023

Sophist-UK Jun 12, 2023 •

edited

PICARD-2655: Improve matching for VA releases #1918

Are you sure you want to change the base?

PICARD-2655: Improve matching for VA releases #1918

Conversation

zas commented Oct 13, 2021 • edited

Summary

Problem

Solution

Action

Sophist-UK Jun 11, 2023 • edited

Choose a reason for hiding this comment

zas Jun 11, 2023

Choose a reason for hiding this comment

Sophist-UK Jun 11, 2023

Choose a reason for hiding this comment

phw Jun 12, 2023

Choose a reason for hiding this comment

Sophist-UK Jun 12, 2023 • edited

Choose a reason for hiding this comment

zas commented Oct 13, 2021 •

edited

Sophist-UK Jun 11, 2023 •

edited

Sophist-UK Jun 12, 2023 •

edited