Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a underscore character to valid characters ignores underscore #37

Open
lazzarello opened this issue May 3, 2022 · 1 comment

Comments

@lazzarello
Copy link

Describe the bug
Adding a special character (an underscore) to valid_chars_for_string does not exclude results which do not have the character in the string, until two misses.

To Reproduce

Initialized with

_valid_chars = '_' + string.ascii_lowercase
words = {'i_love_code': {'count': 5}, 'island': {'count': 2}, 'ironman': {'count': 2}, 'i_love_coding': {'count': 2}, 'i_love_machine_learning': {'count': 3}}
autocomplete = AutoComplete(words=words, synonyms={}, valid_chars_for_string=_valid_chars)
autocomplete.search(word=search_string, max_cost=1, size=10)

Formatted output with simulated input:

Valid Characers: _abcdefghijklmnopqrstuvwxyz
Search Input 'i' : i_love_code, i_love_machine_learning, island, ironman, i_love_coding
Search Input 'ir' : ironman
Search Input 'iro' : ironman
Search Input 'iron' : ironman
Search Input 'iron_' : ironman
Search Input 'iron_m' : ironman
Search Input 'iron_ma' : 
Search Input 'iron_mai' : 
Search Input 'iron_maid' : 
Search Input 'iron_maide' : 
Search Input 'iron_maiden' : 

Search Input 'i' : i_love_code, i_love_machine_learning, island, ironman, i_love_coding, iron_maiden

Expected behavior
Much like the input 'ir' excludes 'i_love_code' I would expect 'iron_' to exclude 'ironman' and so forth. From this output, it looks like it only begins to exclude 'ironman' when the input reaches 'iron_ma'.

OS, DeepDiff version and Python version (please complete the following information):

  • OS: Ubuntu 21.04 + pyenv
  • Python 3.9.7

Additional context

This seems to have something to do with the max_cost parameter. If I raise it > 2 it matches even more then the unexpected results.

@seperman
Copy link
Owner

seperman commented Dec 9, 2022

Hi @lazzarello
The fuzzy matching logic still sees enough similarities between them to include it in the results. You are right that the underscore character is treated differently. That's because internally we convert all spaces into underscores. Maybe internally we should switch from using underscore for that purpose to a Unicode character that is barely used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants