Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero-copy loading of precomputed dictionaries #65

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

dennisss
Copy link

@dennisss dennisss commented Apr 6, 2018

Why?

Wanted a way to avoid the browser frozen for half a second or more while loading the dictionary

How

  • Modularized the instance loading functions to make it easier to swap out the massive dictionaryTable variable for any generic key-value store interface
  • Added a JS implementation of an SSTable which operates on the raw binary buffer downloaded
    • For en_US, the sst file is ~500Kb gzipped vs ~250Kb gzipped for the raw .dic
    • Aside from some meta-data, this means that memory usage is only what is required to store the file and not any less-compact javascript object map
    • Loads almost instantly as little preprocessing is required
  • Added a node.js script for loading in a dictionary using the standard hunspell file loader and outputting it as an optimized sst.
    • So basically any language you can load at least once on a high memory computer can be cached for the browser to load instantly
  • The new format is currently only supported for require() environments due to some dependencies, but otherwise all other functionality should still work as before

Limitations

  • When using the precomputed format
    • Only supports words up to 255 bytes long
    • Doesn't support non-single character FLAG settings in AFF files (but could be easily solved with some work)

Other things fixed:

  • Fixed character set issues when using node.js from a file
  • Proper serialization and deserialization of regexps in the case of loading properties from a json
  • Also did some reorganization as a more node-like package while maintaining the pure javascript support
  • Added mocha for running tests to validate the new precomputed dictionary format

Benchmarks

TLDR: Loading from the precomputed dictionary table is magnitudes better in terms of memory usage and load time. But, checking and suggesting operations are much slower. From my tests, more performance could probably be obtained by optimizing the new datastructure, but so far it has been sufficient for interactive applications.

Memory usage taken manually by idling the following script:

var t = require('.');
var d = new t();
d.loadPrecomputed('en_US'); // d.load('en_US')

setTimeout(function(){ console.log('done!') }, 1000000)

Overall node process memory usage:

Using loadPrecomputed: 14.8Mb
Using load: 59.9Mb

Speed

Generated by the bin/benchmark.js script.

Dictionary load time

  • regular 0.302s
  • precomputed 0.022s

dict.check() speed

  • hypersensitiveness (reg) 0.178s
  • hypersensitiveness (pre) 2.051s
  • Abbott's (reg) 0.114s
  • Abbott's (pre) 0.916s
  • 9th (reg) 0.147s
  • 9th (pre) 0.994s
  • aaraara (reg) 0.299s
  • aaraara (pre) 0.957s
  • didn't (reg) 0.11s
  • didn't (pre) 0.804s
  • he're (reg) 0.277s
  • he're (pre) 0.914s

@cfinke
Copy link
Owner

cfinke commented Apr 30, 2018

I'm fully in support of this idea. A few questions first though:

  1. Is it fully backwards compatible? If this change were merged and every single person using Typo started using it, would they have to change anything?

  2. Are you able to split any changes from "Other things fixed" into their own pull requests?

Thanks,

Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants