Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

50% speed increase for suggest call #41

Merged
merged 1 commit into from
Sep 20, 2016
Merged

50% speed increase for suggest call #41

merged 1 commit into from
Sep 20, 2016

Conversation

Thomas101
Copy link
Contributor

I've done some performance tweaking for the suggest call. No changes to api, no changes to results. Algorithm remains the same. This removes some redundancy in processing.

  • Now uses a stripped down version of es6 sets (where available) or a polyfill when required. This reduces the amount of suggestions generate by edits1 by about a half just by removing duplicates
  • When checking the suggested results it now uses self.checkExact rather than self.check as we will only be looking for exact matches here and self.check just introduces extra processing.

Here's some stats from some test words...

takr
es6: 52ms
polyfil: 54ms
original: 115ms
55% faster

hoouse
es6: 134ms
polyfil: 133ms
original: 200ms
33% faster

suggestionsr
es6: 590ms
polyfil: 598ms
original: 1201ms
51% faster

dsdsfdsfdsfdfsdfsdfsdf
es6: 3105ms
polyfil: 3116ms
original: 5676ms
46% faster

…ed suggest to use self.checkExact rather than self.check as we would only be looking for exact matches here
@kofifus
Copy link

kofifus commented Sep 4, 2016

good work :)

one fix you can do is that in https://github.com/Thomas101/Typo.js/blob/bb9862bd6d0c81fa0ebdc848bdd0ebb7b222f644/typo/typo.js#L932

it's enought to do index[v]=true; which will save some memory.

But in general the problem with your solution is memory usage. doing ed1.forEach(function(word) { ed2.add(word) }); creates a huge array (easily over 100,000 words), your WordList then adds another array (index) of the same size and another hash (words) of the same size, and Set probably does the same internally. All this on top of the dictionaries that typo loads to memory.

In #45 I optimized the memory use of suggest to ~ 2* edit1 that is ~1000 words. This can be further optimized to 1 * edit1 - when you think about it we can run 'known' inside the iteration that builds ed2 without storing the ed2 values as we never need them again. I found I got huge speed gains as well - I think not holding these huge structures results in less paging and everything going much faster.

But that kind of solution and async implementation still leaves the issue of duplicate entries. I wonder if a more clever edit1 algorithm can not produce duplicate in the first place. Or something like https://github.com/itslenny/SymSpell.js/blob/master/dist/SymSpell.js

Cheers

@Thomas101
Copy link
Contributor Author

I agree on the fix, I don't know what came over me there for adding the value in twice! This patch was the lowest hanging fruit I could find with a pretty big bang for buck in terms of speed improvements. There's definately further optimisation that can be done on the suggestion algorithm as a whole. Both in terms of speed and memory.

@cfinke cfinke merged commit 985c0db into cfinke:master Sep 20, 2016
@cfinke
Copy link
Owner

cfinke commented Sep 20, 2016

Thanks for the contribution @Thomas101 and @kofifus, and sorry for the delay in merging.

@cfinke
Copy link
Owner

cfinke commented Sep 20, 2016

I had to revert this -- the quality of the suggest() results dropped noticeably. @Thomas101 Was there an intended change to functionality, or was it supposed to be performance changes only?

@Thomas101
Copy link
Contributor Author

The results should have been the same. Do you have any sample suggestion sets? Were you using node or the browser?

@cfinke
Copy link
Owner

cfinke commented Sep 25, 2016

I was running the tests in the browser via the Chrome extension.

An example: suggest( 'speling' ) returned ["seeking", "spieling", "pelting"], which does not seem as helpful as the old [ "spelling", "spieling", "spewing" ]. All of the old suggestions were one modification away from the argument to suggest(), but two of the new suggestions are two modifications away.

@cfinke
Copy link
Owner

cfinke commented Feb 13, 2017

I realized today why the suggestions changed -- Typo was using the fact that a word added to the edits1() return value multiple times in order to weight that word as a more useful spelling suggestion. The WordList class can't take advantage of this because it uses a hash to store those values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants