Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to search not only from first word? #10

Open
eiva opened this issue Nov 2, 2019 · 10 comments
Open

How to search not only from first word? #10

eiva opened this issue Nov 2, 2019 · 10 comments

Comments

@eiva
Copy link

eiva commented Nov 2, 2019

Hello, is there any way to provide autosuggestion based not only by first word?

words = {'acura zdx': {},
             'acura abc': {},
             'bmw test': {},
             'bmw coupe': {},
             }
    synonyms = {}
    autocomplete = AutoComplete(words=words, synonyms=synonyms)
    print(autocomplete.search(word='test', size=2))

Prints empty array - so it is not search by second word...

@seperman
Copy link
Owner

seperman commented Nov 5, 2019

Hi @eiva
What I recommend is to add these variations of words in the words dictionary. acura rlx and test acura rlx. We do already have code that does all this automatically but is not explicitly added to fast autocomplete yet.

So for example your words dictionary is going to be:

{
  "acura rlx": [
    {
      "model": "rlx",
      "make": "acura"
    },
    "Acura RLX",
    3132
  ],
  "test acura rlx": [
    {
      "model": "rlx",
      "make": "acura"
    },
    "Acura RLX",
    3132
  ],
  "rlx": [
    {
      "model": "rlx",
      "make": "acura"
    },
    "Acura RLX",
    3132
  ],
  "acura": [
    {
      "make": "acura"
    },
    "Acura",
    130123
  ],
  ...
}

And then you can use the factory function as described in: https://github.com/wearefair/fast-autocomplete#sorting

You will see a similar example in that page.

@jayaddison
Copy link
Contributor

A small +1 on this issue, to agree that in-built support for this would be a useful extra (even if optional / disabled-by-default).

@jayaddison
Copy link
Contributor

(afterthought: if it's challenging to implement within the library for algorithmic and/or performance reasons, perhaps including a usage example like the one above in the repo itself would be a reasonable alternative)

@seperman
Copy link
Owner

seperman commented Jul 1, 2020

Hi @jayaddison
I have not had a chance to implement it. But basically the solution is called Gaddag: https://en.wikipedia.org/wiki/GADDAG
I agree with you that it is easier to leave some examples in the readme for now! I will try to leave some examples soon. If you have a chance to open a PR and add some examples that would be great too till Gaddag is implemented.
Thanks!

@tomerav0
Copy link

tomerav0 commented Sep 28, 2020

I "sovled" it by kind of brute force.
I took each phrase I want to use and created all the combination for that phrase.
I used another array to map each combination to the original value.
For each combination I attached UUID that later I parse at the results.

original_word = row["name"]
          #words[original_word] = [{}, original_word, row["count"]]
          parts = original_word.split()
      
          if len(parts) > 1 and len(parts) <= self._combo_words_limit:   
            for words_combo in itertools.permutations(parts, len(parts)):
                pharse, id = self.buildPharse(words_combo)
                pharses_map.append({id : original_word})
                words.append({pharse : [{}, original_word, row["count"]]})
          else:
                pharse, id = self.buildPharse(original_word)
                pharses_map.append({id : original_word})
                words.append({pharse : [{}, original_word, row["count"]]})

@seperman
Copy link
Owner

seperman commented Sep 29, 2020

@tomerav0 Yeah that works! In fact I have used something similar to your solution before too.

@Ronserruya
Copy link

I "sovled" it by kind of brute force.
I took each phrase I want to use and created all the combination for that phrase.
I used another array to map each combination to the original value.
For each combination I attached UUID that later I parse at the results.

original_word = row["name"]
          #words[original_word] = [{}, original_word, row["count"]]
          parts = original_word.split()
      
          if len(parts) > 1 and len(parts) <= self._combo_words_limit:   
            for words_combo in itertools.permutations(parts, len(parts)):
                pharse, id = self.buildPharse(words_combo)
                pharses_map.append({id : original_word})
                words.append({pharse : [{}, original_word, row["count"]]})
          else:
                pharse, id = self.buildPharse(original_word)
                pharses_map.append({id : original_word})
                words.append({pharse : [{}, original_word, row["count"]]})

Would you mind posting a more complete working example? maybe a gist?

@tomerav0
Copy link

I made it to work but finally gave up on this because its extremely RAM consuming not feasible in any sense of cost. Im talking about 60GB of RAM needed just to load the model.
This lib is good for small datasets but if you go back its a mess.

@seperman
Copy link
Owner

seperman commented Oct 28, 2020 via email

@angrygoats
Copy link

angrygoats commented Nov 12, 2020

Related to @tomerav0's memory issue:

@seperman I think there is a solution here you could do with cython. Cython supports extension types. These extension types ("cdef classes") operate using C structs which should have far less overhead.

I glanced through the code and the Autocomplete class has some very memory intense operations in it. These could be cleaned up with Cython as well.

@tomerav0 if the data isn't proprietary (or if you could scrub it) it would be useful to have in order to profile the code before making any changes.

@seperman once you have some test data you can profile using:

python -m cProfile -o memory_profile.profile <test_script>.py <any_args_here>

and then you can inspect it via:

python -m pstats memory_profile.profile

and then check calls using

sort time
stats 10

To show the top 10 biggest call sites. You could also look into memory_profiler. I haven't had a need for it but it seems popular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants