fuzzy: Match substrings fuzzily #2043

Phyks · 2016-06-08T18:10:29Z

Hi,

For a project I have, I need to match Youtube video titles against my own music collection managed by beets. Problem is Youtube videos have very different titles, and often have extra noise like "(official video !!!)" which prevents from using beet ls directly.

I came up with some heuristics to sanitize them, but still, this is not really reliable.

I am not sure if anyone already did it (but I could not find it) or if it might be interesting either for beet or anyone here to have a full text search in beet?

In case it might be interesting, either to be merged or as a plugin, I am open to any feedback or advice. I was planning on using whoosh for my particular prototyping case.

Thanks

The text was updated successfully, but these errors were encountered:

jackwilsdon · 2016-06-08T18:11:42Z

What do you mean by "full text search"? It sounds like you might be looking for the fuzzy plugin.

Phyks · 2016-06-08T18:15:15Z

I mean something as what ElasticSearch does, or SQL MATCH or even Google. Just typing some search and getting back results with an associated pertinence score.

For my particular use case, it would allow me to search for "Michael Jackson - Black or white (original version) 1999" directly in beet and still get a result, whereas the search now does not return any result because it cannot match everything.

It would also be resilient to typos I think.

jackwilsdon · 2016-06-08T18:17:33Z

That certainly sounds like what the fuzzy plugin:

The fuzzy plugin provides a prefixed query that searches your library using fuzzy pattern matching. This can be useful if you want to find a track with complicated characters in the title.

You can adjust the threshold (i.e. sensitivity) in your configuration as well.

Phyks · 2016-06-08T18:26:08Z

Indeed, I missed it, my bad.

Still, it seems it is only looking at track title and not performing a query on every available field as regular ls do. Or (more likely) I missed something…

sampsyo · 2016-06-08T18:50:12Z

I believe the plugin should query the standard set of fields, unless you tell it not to (as with ordinary queries).

Phyks · 2016-06-08T19:15:33Z

Hmm, I have in my config:

plugins: … fuzzy
fuzzy:
    prefix: "~"
    threshold: 0.8

(I set a higher threshold to get less false positives)

Then, beet -l good.blb -c ~/.beets/base.conf.yaml ls "~michael" returns anything but musics from "Michael Jackson". When looking at the results, some are definitely coming from a song title match (exact match with michael, or songs named "camel" or stuff like this), some are coming from an album name match ("Miracles" for instance) but I do not see anyone obviously coming from a match in artist name.

Note that the standard search beet -l good.blb -c ~/.beets/base.conf.yaml ls "michael" does return my Michael Jackson discography.

Lowering the threshold increases a lot the number of false positives, but does not seem to give any Michael Jackson song either.

sampsyo · 2016-06-08T19:47:26Z

I think the similarity is proportional to the entire field, not just a substring. Have you tried something like ~'michael jackso'?

Phyks · 2016-06-08T20:32:51Z

Indeed, it is the case. And using ~"michael jackso" works. Though, it no longer works if I use more sophisticated string like ~"mickael jackson - black or white".

What I am looking for is the same thing as SQL MATCH syntax:

> SELECT artist.name AS artist, album.name AS album, title, MATCH (artist.name, album.name, song.title) AGAINST ("michael jackson - black or white (original version) 1999" IN BOOLEAN MODE) AS score FROM song LEFT JOIN artist ON song.artist = artist.id LEFT JOIN album ON song.album = album.id WHERE MATCH (artist.name, album.name, song.title) AGAINST ("michael jackson - black or white (original version) 1999" IN BOOLEAN MODE) ORDER BY score DESC LIMIT 1;
+-----------------+---------------------------+----------------+-------+
| artist          | album                     | title          | score |
+-----------------+---------------------------+----------------+-------+
| Michael Jackson | Essential Michael Jackson | Black or White |     4 |
+-----------------+---------------------------+----------------+-------+

I am not sure whether it exists or not already in beets, or if it could be of any use to anyone else than me?

sampsyo · 2016-06-08T21:10:31Z

Sure, it might make sense to extend the fuzzy plugin for this purpose. Substring fuzzy matching is probably more intuitive anyway. Would that make sense for you?

Phyks · 2016-06-09T09:38:47Z

Yes, I think that would do it.

sampsyo · 2016-06-09T23:44:25Z

Cool! I've updated the title to reflect that idea.

carreter · 2024-02-23T15:55:43Z

Sorry for the necro, but I just wanted to say this would be immensely helpful. I don't think it would be too difficult to implement, either.

Off the top of my head, I think the way to do this would be to change how the ratio threshold is calculated. The current implementation uses difflib to get an upper bound on the similarity between a query and a database entry:

beets/beetsplug/fuzzy.py

Lines 26 to 34 in dae5257

    
           class FuzzyQuery(StringFieldQuery): 
        
               @classmethod 
        
               def string_match(cls, pattern, val): 
        
                   # smartcase 
        
                   if pattern.islower(): 
        
                       val = val.lower() 
        
                   query_matcher = difflib.SequenceMatcher(None, pattern, val) 
        
                   threshold = config["fuzzy"]["threshold"].as_number() 
        
                   return query_matcher.quick_ratio() >= threshold

Looking at difflib this ratio is defined as 2.0 * matched_characters / (len(pattern) + len(val)). Notably, matched_characters <= min(len(pattern) + len(val)), so if pattern is 1/10th the size of val the highest match you're going to get is 1/ (10 + 1) = 0.09.

What I propose is that the threshold calculation be changed to:

threshold = config["fuzzy"]["threshold"].as_number()
if len(pattern) < len(val):
    max_possible_ratio = len(pattern) / (len(pattern) + len(val))
    threshold *= max_possible_ratio

This should not impact performance at all and should solve this issue. Happy to put up a PR!

carreter · 2024-03-11T23:01:36Z

Code snippet I provided was slightly wrong, should be 2 * len(pattern) / (len(pattern) + len(val)).

I also noticed that using quick_ratio alone can result in too much matching (it's an upper bound, so it can result in the algo being too lenient), so I changed the method to calculate the exact ratio if quick_ratio meets the threshold. This should improve accuracy with minimal performance cost.

sampsyo added the needinfo We need more details or follow-up from the filer before this can be tagged "bug" or "feature." label Jun 8, 2016

sampsyo changed the title ~~[Idea] Full text search in beet?~~ fuzzy: Match substrings fuzzily Jun 9, 2016

sampsyo added feature features we would like to implement and removed needinfo We need more details or follow-up from the filer before this can be tagged "bug" or "feature." labels Jun 9, 2016

carreter linked a pull request Mar 11, 2024 that will close this issue

Match substrings fuzzily #5140

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fuzzy: Match substrings fuzzily #2043

fuzzy: Match substrings fuzzily #2043

Phyks commented Jun 8, 2016

jackwilsdon commented Jun 8, 2016 •

edited

Phyks commented Jun 8, 2016

jackwilsdon commented Jun 8, 2016

Phyks commented Jun 8, 2016

sampsyo commented Jun 8, 2016

Phyks commented Jun 8, 2016

sampsyo commented Jun 8, 2016

Phyks commented Jun 8, 2016

sampsyo commented Jun 8, 2016

Phyks commented Jun 9, 2016

sampsyo commented Jun 9, 2016

carreter commented Feb 23, 2024

carreter commented Mar 11, 2024 •

edited

fuzzy: Match substrings fuzzily #2043

fuzzy: Match substrings fuzzily #2043

Comments

Phyks commented Jun 8, 2016

jackwilsdon commented Jun 8, 2016 • edited

Phyks commented Jun 8, 2016

jackwilsdon commented Jun 8, 2016

Phyks commented Jun 8, 2016

sampsyo commented Jun 8, 2016

Phyks commented Jun 8, 2016

sampsyo commented Jun 8, 2016

Phyks commented Jun 8, 2016

sampsyo commented Jun 8, 2016

Phyks commented Jun 9, 2016

sampsyo commented Jun 9, 2016

carreter commented Feb 23, 2024

carreter commented Mar 11, 2024 • edited

jackwilsdon commented Jun 8, 2016 •

edited

carreter commented Mar 11, 2024 •

edited