Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PICARD-2655: Improve matching for VA releases #1918

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

zas
Copy link
Collaborator

@zas zas commented Oct 13, 2021

  • if an album was incorrectly tagged with a label name or whatever, instead of Various Artists
    Picard tries to match Various Artists (from database) to the name, usually leading to a very low similarity, reducing a lot the chance to find the correct release

  • a contrario, if an album was tagged with Various Artists, it is very likely it is a VA compilation, so increase the weight

Summary

  • This is a…
    • Bug fix
    • Feature addition
    • Refactoring
    • Minor / simple change (like a typo)
    • Other
  • Describe this change in 1-2 sentences:

Problem

Solution

Action

picard/metadata.py Show resolved Hide resolved
- if an album was incorrectly tagged with a label name or whatever, instead of Various Artists
Picard tries to match Various Artists (from database) to the name, usually leading to a very low similarity, reducing a lot the chance to find the correct release

- a contrario, if an album was tagged with Various Artists, it is very likely it is a VA compilation, so increase the weight
@zas zas changed the title Improve matching for VA releases PICARD-2655: Improve matching for VA releases Jun 11, 2023
@zas zas requested a review from phw June 11, 2023 16:39
@zas zas marked this pull request as ready for review June 11, 2023 16:39
# it is fairly common VA release are tagged with label's name
# so if a release is credited to VA, assume it is similar
# to artist in tag
sim = 1.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am entirely unsure that this logic is at all correct for various reasons (though it may equally be that I don't understand how this is supposed to work) but:

  1. I think that we should be considering a situation where we give the artist similarity a value of 1 ONLY if the album artist is "various" or "various artists" AND the release we are matching has a VA_ID. As the code stands if the album artist is "The Beatles" and the release we are matching is VA, artist similarity is 1. Or are we going to assume that any release which is VA is effectively going to match any album artist regardless of whether it is "various"?

  2. What happens if the album is not in english e.g. the album is a portuguese language album and the album artist is "vários artistas"? I think we need a much longer list of VA textual names which covers all languages that we support. (Edit: Not sure how text comparisons work where there are accents - but we might want not only to make a lower case comparison as in the current code, but also to remove accents.)

  3. I am unclear why we are doubling the weighting for artist if the album is "various" rather than halving it. In the specific use case described in the PR description, if the album artist is "various" and it is failing to match a release because the release artist is the label rather than "various" then surely we should give less weight to the artists not matching rather than more weight?

Do we need some unit tests with various albums and various potential releases to test that we get the best release selected?

Edited: to add note about accented characters.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd love have tests for that... thanks for volunteering ;)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am happy to volunteer to write tests for this providing that you are prepared to wait until I have the time to spend on open source and for this to get to the top of my queue of open source stuff I have planned. 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Sophist-UK that we likely should combine the tests of tags indicating VA and the release being VA.

The goal of the patch is to improve the matching of files from a VA release to VA. And one of the problems here is that pure matching of album artist tag with the MB artist name ("Various Artists") often does not yield good results.

The code will indeed improve the situation. But at the same time it will make matching a VA release from MB to any file more likely, as all VA releases will always have full albumartist similarity.That might increase the problem that files get tagged too much against compilation, something many users already struggle with.

Combining the check of existing album artist tags indicating VA and the release being VA would avoid this. For the localization I would suggest to check for a.lower() in {'various', 'various artists', config.setting['va_name'].lower()}.

This will not solve the case where the album artist is set to the label name or something completely different, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@phw

A. I am really not sure that we should use config.setting['va_name'] as this is stated in the documentation to be the output name when you use a VA_ID release - not the least for file naming so that the user can group together albums) - so this is (usually) the users local idiom for VA. But (clearly) a user can be trying to match a VA album from a different country, so Picard needs to recognise these. This is why I suggested a list (constant -> one off effort) of V/VA for every locale supported. And this same list could then be used by MB for helping improve the data, so the same effort could then also be reused. (P.S. It won't be perfect as a first iteration, but Google / Deepl translate can give you a first cut for translations of VA without too much effort. If it will help I can find the time do this and post the results here for reuse.)

B. There are probably four use cases to consider, two where the album artist is set to the label name:

  1. The album artist MBID is the special VA_ID but the album artist text is set to something other than (one of the translations of) "VA". Here Picard can know for certain that the release is a VA. The PR description has album artist as label as the use case, but the current code seems to use this more general use case.
  2. The album artist MBID is not the special VA_ID even though the album is in fact a VA. I am not sure that we can do anything in Picard about this because in most cases (see exception cases that follow) it is indistinguishable from a non-VA album.
  3. Special case of 2., where the MBID is not VA_ID but the text album artist is "Various Artists" (or a translation). Here Picard can also reasonably assume that this is a VA.
  4. Special case of 2., where the MBID is not VA_ID but the album artist text is the same as (or very similar to) the label. This is another variation of the PR description use case, and perhaps Picard should try to fix this too, especially if Picard is not going to tweak the data but rather just tweak the fuzzy search.

Of course the principle cause of all three use cases is bad data - so the primary fix should be to have an MB report to identify use cases 1, 3 and 4 and have editors fix the data. (Or indeed, have some code for case 3 to auto-create (voteable) edits to fix the album artist MBID.) However, as per this PR Picard should still try to handle bad data because it will never be perfect - and the code should probably deal with cases 1 and 3 and perhaps also 4 (since it is only tweaking the fuzzy match and not making hard changes to the data).

C. That said, we should also consider whether we want Picard to change the data if this release is chosen. If we genuinely believe that use cases 1, 2 and 4 are detectable and we are tweaking the fuzzy search, should we also tweak the MB data to "improve" it? (This is much more contentious, and I put this up for discussion because I have not yet come to an ethical/practical personal conclusion.) If we do this, we would have to do it elsewhere in the code because once the Album has an Album MBID, fuzzy search code is not executed.

D. I think that fuzzy match fudges like this to attempt to correct bad MBID data should have an options switch to enable / disable them. And if we decide to fix the data too, that definitely should have a switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants