Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary format for the static prediction tables #7

Open
Yoric opened this issue Oct 4, 2018 · 5 comments
Open

Binary format for the static prediction tables #7

Yoric opened this issue Oct 4, 2018 · 5 comments

Comments

@Yoric
Copy link
Contributor

Yoric commented Oct 4, 2018

Since we want to ship dictionaries, we need to specify their format, too.

I'm pretty sure that we can use efficient entropy coding also for the dictionaries, fwiw.

@Yoric
Copy link
Contributor Author

Yoric commented Oct 9, 2018

I got some data with the latest dictionary builder. We'll need an efficient format to actually ship dictionaries.

@Yoric
Copy link
Contributor Author

Yoric commented Oct 9, 2018

A few things to recall:

  • whatever dictionary we ship today, we still need to ship in 10 years (the format may change, but not the contents);
  • we will need to ship new dictionaries whenever the JavaScript AST changes, which suggests we may wish to ship them somehow as diffs;
  • after a few versions of JavaScript, we'll end up with several dictionaries in memory at the same time;
  • developers may wish to ship a new dictionary because for some reason their code is not compressed correctly by our entropy coder, which also suggest we may wish to ship dictionaries as diffs.

In turn, this shipping as diffs suggests that we want to ship number of occurrences (easy to add to without precision loss), rather than actual probabilities (floating points, harder to manipulate).

@kannanvijayan-zz
Copy link

Recapping some conversations from IRC here.

Firstly, I feel we should settle on some consistent terminology. "Dictionary" is a very overloaded term and suggests words or character-sequence tables of some sort. I would suggest using "Prediction Tables" to talk about this. They are mappings from various contexts (e.g. Tree-path-suffix, or move-to-front cache of strings) to a table which acts as a predictor that provides a probability distribution for elements showing up at some location in the data stream.

That said, the key thing that's relevant here is that we have decided that prediction tables will not be shipped in-line with the content, and that we will allow compressed content to reference a prediction table (e.g. with a URL + verification hash). This effectively offloads the problem entirely.

Initially, we can produce a "standard" prediction table derived from a "general corpus" of javascript. This is the default table which tools can use. This table can be made available at a standard "well-known" location (or shipped with the browser, as an implementation detail).

In the general case, an individual binast file may reference a custom prediction table. However, we know that we can expect that these tables will change very frequently, and will be effectively be cached locally for every request that is made to content that references it (except for the first).

From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:

  1. Receive binast file A.
  2. Header of binast file A identifies probability table T by URL+content-hash
  3. Check local cache for T (using content-hash as key) - highly likely to hit. Retrieve T and cache if necessary.
  4. Proceed to decoding binast file A using probability table T.

In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective.

@kannanvijayan-zz
Copy link

Pinging @dominiccooney for discussion alert.

@Yoric Yoric changed the title Binary format for the dictionary itself Binary format for the static prediction tables Oct 10, 2018
Yoric added a commit to Yoric/binjs-ref that referenced this issue Oct 10, 2018
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde
(de)serialization mechanism. While we have not decided a format for storing dictionaries
(that's binast/container-format-spec#7 ), this will let us
experiment using a simple, temporary format.
Yoric added a commit to Yoric/binjs-ref that referenced this issue Oct 11, 2018
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde
(de)serialization mechanism. While we have not decided a format for storing dictionaries
(that's binast/container-format-spec#7 ), this will let us
experiment using a simple, temporary format.
Yoric added a commit to Yoric/binjs-ref that referenced this issue Oct 11, 2018
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde
(de)serialization mechanism. While we have not decided a format for storing dictionaries
(that's binast/container-format-spec#7 ), this will let us
experiment using a simple, temporary format.
Yoric added a commit to Yoric/binjs-ref that referenced this issue Oct 11, 2018
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde
(de)serialization mechanism. While we have not decided a format for storing dictionaries
(that's binast/container-format-spec#7 ), this will let us
experiment using a simple, temporary format.
Yoric added a commit to Yoric/binjs-ref that referenced this issue Oct 11, 2018
This changeset extendes Dictionary<> and KindedStringsPerFile<> to support the generic Serde
(de)serialization mechanism. While we have not decided a format for storing dictionaries
(that's binast/container-format-spec#7 ), this will let us
experiment using a simple, temporary format.
@Yoric
Copy link
Contributor Author

Yoric commented Nov 8, 2018

From the browser's side, the logic for decoding an incoming binast file A referencing a table T would be:

Receive binast file A.
Header of binast file A identifies probability table T by URL+content-hash
Check local cache for T (using content-hash as key) - highly likely to hit. Retrieve T and cache if necessary.
Proceed to decoding binast file A using probability table T.

In this scenario, the "default table" is simply just another table that may be referenced. Whether it's shipped with the browser or not is an implementation detail that doens't matter from a performance perspective.

Note that it's a bit more complicated than that, as:

  • A and T need to be received by the browser (in Firefox, that's by Necko);
  • A and T need to be decompressed by the JS VM (in Firefox, that's SpiderMonkey).

We should move this conversation in the relevant bug, though: #4 .
Consequently, we can already imagine that SpiderMonkey will need to expose an API for Necko to inspect A to determine association with T.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants