Zero-copy loading of precomputed dictionaries #65

dennisss · 2018-04-06T21:45:36Z

Why?

Wanted a way to avoid the browser frozen for half a second or more while loading the dictionary

How

Modularized the instance loading functions to make it easier to swap out the massive dictionaryTable variable for any generic key-value store interface
Added a JS implementation of an SSTable which operates on the raw binary buffer downloaded
- For en_US, the sst file is ~500Kb gzipped vs ~250Kb gzipped for the raw .dic
- Aside from some meta-data, this means that memory usage is only what is required to store the file and not any less-compact javascript object map
- Loads almost instantly as little preprocessing is required
Added a node.js script for loading in a dictionary using the standard hunspell file loader and outputting it as an optimized sst.
- So basically any language you can load at least once on a high memory computer can be cached for the browser to load instantly
The new format is currently only supported for require() environments due to some dependencies, but otherwise all other functionality should still work as before

Limitations

When using the precomputed format
- Only supports words up to 255 bytes long
- Doesn't support non-single character FLAG settings in AFF files (but could be easily solved with some work)

Other things fixed:

Fixed character set issues when using node.js from a file
Proper serialization and deserialization of regexps in the case of loading properties from a json
Also did some reorganization as a more node-like package while maintaining the pure javascript support
Added mocha for running tests to validate the new precomputed dictionary format

Benchmarks

TLDR: Loading from the precomputed dictionary table is magnitudes better in terms of memory usage and load time. But, checking and suggesting operations are much slower. From my tests, more performance could probably be obtained by optimizing the new datastructure, but so far it has been sufficient for interactive applications.

Memory usage taken manually by idling the following script:

var t = require('.');
var d = new t();
d.loadPrecomputed('en_US'); // d.load('en_US')

setTimeout(function(){ console.log('done!') }, 1000000)

Overall node process memory usage:

Using loadPrecomputed: 14.8Mb
Using load: 59.9Mb

Speed

Generated by the bin/benchmark.js script.

Dictionary load time

regular 0.302s

precomputed 0.022s

dict.check() speed

hypersensitiveness (reg) 0.178s

hypersensitiveness (pre) 2.051s

Abbott's (reg) 0.114s

Abbott's (pre) 0.916s

9th (reg) 0.147s

9th (pre) 0.994s

aaraara (reg) 0.299s

aaraara (pre) 0.957s

didn't (reg) 0.11s

didn't (pre) 0.804s

he're (reg) 0.277s

he're (pre) 0.914s

cfinke · 2018-04-30T21:14:15Z

I'm fully in support of this idea. A few questions first though:

Is it fully backwards compatible? If this change were merged and every single person using Typo started using it, would they have to change anything?
Are you able to split any changes from "Other things fixed" into their own pull requests?

Thanks,

Chris

dennisss added 13 commits September 24, 2017 10:22

Supporting modular loading of dictionaries

69d89fd

Improve character set handling

6590d7b

Settings fixes

f790c5d

Restructure to be more node module like

c02c903

Add precompute script and sstable implementation

2338465

Add usage information for precomputed method

7d5c549

Add mocha tests and a benchmark script

4d985f7

Cleanup

1c59916

Some sst performance optimizations

dda4aa4

Bump version

8188a23

Using sstable as separate package

ade740b

Bump dependency

8b7d204

ES5 compat fixes

19839ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero-copy loading of precomputed dictionaries #65

Zero-copy loading of precomputed dictionaries #65

dennisss commented Apr 6, 2018

cfinke commented Apr 30, 2018

Zero-copy loading of precomputed dictionaries #65

Are you sure you want to change the base?

Zero-copy loading of precomputed dictionaries #65

Conversation

dennisss commented Apr 6, 2018

Why?

How

Limitations

Other things fixed:

Benchmarks

cfinke commented Apr 30, 2018