Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fuzzy: Match substrings fuzzily #2043

Open
Phyks opened this issue Jun 8, 2016 · 13 comments · May be fixed by #5140
Open

fuzzy: Match substrings fuzzily #2043

Phyks opened this issue Jun 8, 2016 · 13 comments · May be fixed by #5140
Labels
feature features we would like to implement

Comments

@Phyks
Copy link

Phyks commented Jun 8, 2016

Hi,

For a project I have, I need to match Youtube video titles against my own music collection managed by beets. Problem is Youtube videos have very different titles, and often have extra noise like "(official video !!!)" which prevents from using beet ls directly.

I came up with some heuristics to sanitize them, but still, this is not really reliable.

I am not sure if anyone already did it (but I could not find it) or if it might be interesting either for beet or anyone here to have a full text search in beet?

In case it might be interesting, either to be merged or as a plugin, I am open to any feedback or advice. I was planning on using whoosh for my particular prototyping case.

Thanks

@jackwilsdon
Copy link
Member

jackwilsdon commented Jun 8, 2016

What do you mean by "full text search"? It sounds like you might be looking for the fuzzy plugin.

@Phyks
Copy link
Author

Phyks commented Jun 8, 2016

I mean something as what ElasticSearch does, or SQL MATCH or even Google. Just typing some search and getting back results with an associated pertinence score.

For my particular use case, it would allow me to search for "Michael Jackson - Black or white (original version) 1999" directly in beet and still get a result, whereas the search now does not return any result because it cannot match everything.

It would also be resilient to typos I think.

@jackwilsdon
Copy link
Member

That certainly sounds like what the fuzzy plugin:

The fuzzy plugin provides a prefixed query that searches your library using fuzzy pattern matching. This can be useful if you want to find a track with complicated characters in the title.

You can adjust the threshold (i.e. sensitivity) in your configuration as well.

@Phyks
Copy link
Author

Phyks commented Jun 8, 2016

Indeed, I missed it, my bad.

Still, it seems it is only looking at track title and not performing a query on every available field as regular ls do. Or (more likely) I missed something…

@sampsyo sampsyo added the needinfo We need more details or follow-up from the filer before this can be tagged "bug" or "feature." label Jun 8, 2016
@sampsyo
Copy link
Member

sampsyo commented Jun 8, 2016

I believe the plugin should query the standard set of fields, unless you tell it not to (as with ordinary queries).

@Phyks
Copy link
Author

Phyks commented Jun 8, 2016

Hmm, I have in my config:

plugins: … fuzzy
fuzzy:
    prefix: "~"
    threshold: 0.8

(I set a higher threshold to get less false positives)

Then, beet -l good.blb -c ~/.beets/base.conf.yaml ls "~michael" returns anything but musics from "Michael Jackson". When looking at the results, some are definitely coming from a song title match (exact match with michael, or songs named "camel" or stuff like this), some are coming from an album name match ("Miracles" for instance) but I do not see anyone obviously coming from a match in artist name.

Note that the standard search beet -l good.blb -c ~/.beets/base.conf.yaml ls "michael" does return my Michael Jackson discography.

Lowering the threshold increases a lot the number of false positives, but does not seem to give any Michael Jackson song either.

@sampsyo
Copy link
Member

sampsyo commented Jun 8, 2016

I think the similarity is proportional to the entire field, not just a substring. Have you tried something like ~'michael jackso'?

@Phyks
Copy link
Author

Phyks commented Jun 8, 2016

Indeed, it is the case. And using ~"michael jackso" works. Though, it no longer works if I use more sophisticated string like ~"mickael jackson - black or white".

What I am looking for is the same thing as SQL MATCH syntax:

> SELECT artist.name AS artist, album.name AS album, title, MATCH (artist.name, album.name, song.title) AGAINST ("michael jackson - black or white (original version) 1999" IN BOOLEAN MODE) AS score FROM song LEFT JOIN artist ON song.artist = artist.id LEFT JOIN album ON song.album = album.id WHERE MATCH (artist.name, album.name, song.title) AGAINST ("michael jackson - black or white (original version) 1999" IN BOOLEAN MODE) ORDER BY score DESC LIMIT 1;
+-----------------+---------------------------+----------------+-------+
| artist          | album                     | title          | score |
+-----------------+---------------------------+----------------+-------+
| Michael Jackson | Essential Michael Jackson | Black or White |     4 |
+-----------------+---------------------------+----------------+-------+

I am not sure whether it exists or not already in beets, or if it could be of any use to anyone else than me?

@sampsyo
Copy link
Member

sampsyo commented Jun 8, 2016

Sure, it might make sense to extend the fuzzy plugin for this purpose. Substring fuzzy matching is probably more intuitive anyway. Would that make sense for you?

@Phyks
Copy link
Author

Phyks commented Jun 9, 2016

Yes, I think that would do it.

@sampsyo sampsyo changed the title [Idea] Full text search in beet? fuzzy: Match substrings fuzzily Jun 9, 2016
@sampsyo sampsyo added feature features we would like to implement and removed needinfo We need more details or follow-up from the filer before this can be tagged "bug" or "feature." labels Jun 9, 2016
@sampsyo
Copy link
Member

sampsyo commented Jun 9, 2016

Cool! I've updated the title to reflect that idea.

@carreter
Copy link

Sorry for the necro, but I just wanted to say this would be immensely helpful. I don't think it would be too difficult to implement, either.

Off the top of my head, I think the way to do this would be to change how the ratio threshold is calculated. The current implementation uses difflib to get an upper bound on the similarity between a query and a database entry:

beets/beetsplug/fuzzy.py

Lines 26 to 34 in dae5257

class FuzzyQuery(StringFieldQuery):
@classmethod
def string_match(cls, pattern, val):
# smartcase
if pattern.islower():
val = val.lower()
query_matcher = difflib.SequenceMatcher(None, pattern, val)
threshold = config["fuzzy"]["threshold"].as_number()
return query_matcher.quick_ratio() >= threshold

Looking at difflib this ratio is defined as 2.0 * matched_characters / (len(pattern) + len(val)). Notably, matched_characters <= min(len(pattern) + len(val)), so if pattern is 1/10th the size of val the highest match you're going to get is 1/ (10 + 1) = 0.09.

What I propose is that the threshold calculation be changed to:

threshold = config["fuzzy"]["threshold"].as_number()
if len(pattern) < len(val):
    max_possible_ratio = len(pattern) / (len(pattern) + len(val))
    threshold *= max_possible_ratio

This should not impact performance at all and should solve this issue. Happy to put up a PR!

@carreter carreter linked a pull request Mar 11, 2024 that will close this issue
3 tasks
@carreter
Copy link

carreter commented Mar 11, 2024

Code snippet I provided was slightly wrong, should be 2 * len(pattern) / (len(pattern) + len(val)).

I also noticed that using quick_ratio alone can result in too much matching (it's an upper bound, so it can result in the algo being too lenient), so I changed the method to calculate the exact ratio if quick_ratio meets the threshold. This should improve accuracy with minimal performance cost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature features we would like to implement
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants