Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

typedarrays / buffers? #22

Open
d4tocchini opened this issue Nov 8, 2018 · 4 comments
Open

typedarrays / buffers? #22

d4tocchini opened this issue Nov 8, 2018 · 4 comments

Comments

@d4tocchini
Copy link

wondering if it's within reasonable effort to read/write typedarrays/buffers? I can see everything goes through JSON stringify & parse, is this a hard constraint or just convention?

@mogill, are you implying here #20 that there's something inherently wrong with buffers in node or just sharedbuffers? ems far outshines anything possible with master-worker shared mem, would love to hear a few more of your thoughts on all this...

@mogill
Copy link
Owner

mogill commented Nov 8, 2018

A pure JS implementation is possible, but would require rewriting the EMS C library in JS. A partial list of that functionality includes:

  • Synchronization mechanisms that directly manipulate CPU cache coherency because OS mutexes cannot be shared between processes
  • Parallel-safe load and store operations built from said synchronization mechanisms
  • Logic to emulate atomic read-modify write (Fetch-and-Add and Compare-and-Swap) operations on non-scalar data
  • A parallel-safe memory manager to allocate the memory to store JSON & strings
  • Off-heap metadata to indicate the type of the stored data as JSON, string, or scalar (null, int, float, or bool)
  • Metadata indicating the state of the memory (full, empty, # readers)
  • Mechanisms for scheduling and dynamically load balancing parallel loops
  • A thread/process pool
  • Mapping the memory to persistent storage

This last one isn't strictly necessary, but without persistent storage the application would need to re-create the in-memory dataset by reading the original data from disk every time the program was run. For datasets which are hundreds of gigabytes in size this is impractical. EMS' implementation directly leverages the OS' physical memory management so the application can be restarted with no time penalty. Specifically, if the EMS data is already anywhere in memory for any reason (i.e.: in an OS buffer cache, or the application is still running), when the application is executed again the existing copy of the data is accessed without going to storage. In fact, the fastest way to load an entire EMS dataset from persistent storage (f.e.: after a reboot) is to copy the EMS file to /dev/null

Node's typedarray is meant to provide low-level mechanisms that exist principally to make Emscripten possible. C/C++ operate on virtual linear addresses, which is what typedarray buffers provide. It's fine for a compiler target, but not idiomatically useful for interoperating with JS variables.

@d4tocchini
Copy link
Author

d4tocchini commented Nov 8, 2018

to clarify, i wasnt looking for pure JS implementation, that sounds like a lot of effort without the payoff. rather, something like an EMS_TYPE_BUFFER. the most common situation being reading and writing img/media buffers. what would you suggest for this use-case, stringify base64 or not use it for media?

beyond that, buffer-like types would eliminate JSON parse/stringify serialization, a non-trivial impact especially if wanting to take advantage of the roomier 64bit address space. i'm assuming buffers would be a more performant data transfer across the native/node barrier when compared to JSON strings, yes?

on this note, the webassembly memory interface provides an autogrowing paged memory buffer:

mem = new WebAssembly.Memory({initial: page_count} // 64kib pages
u8view = new Uint8Array(mem.buffer)
view = new DataView(mem.buffer)
f32 = view.getFloat32(ptr)
...
if (needed) {
  mem.grow(page_increment)   
  view = new DataView(mem.buffer)
}
...

although built for zero-copy wasm transport, it's just a buffer with autogrowing API that has the wasm win-win if & when. buffers offer a lot potential perf wins in vanilla JS land. this single growable buffer pool makes typedArrays more flexible and gives a dynamic escape from garbage collection issues. data kept in buffers (if ergonomically possible) will always conserve cpu & ram compared to parsing & allocating nested JSON. with a little effort you have a more natural foundation for tapping the GPU and in-mem columnar data fun (https://github.com/jpmorganchase/perspective/tree/master/packages/perspective). and, the wasm world is homegrowing an impressive OSS toolset as well. basically, it opens up ems to a larger, perf-oriented ecosystem.

as general principle of efficiency, i prefer keeping as much data as possible as raw as possible. replacing traditional JS allocations with pointer-like buffer indices minimizes v8 deopts by making functions intrinsically more monomorphic.

don't get me wrong, i'm 100% picking up what you're putting down, you clearly demonstrated a better future than workers + shared array buffers. Is there something intrinsically incompatible with ems and buffers? If not, then we should and any advice to help me patch existing codebase?

BTW, I have ems built & working for:

  electron: "4.0.0-nightly.20181010"
  chrome: "69.0.3497.106"
  modules: "64"
  napi: "3"
  node: "10.11.0"
  v8: "6.9.427.24"

it's super impressive, you're the man. i'm shocked this is possible and not the de facto... should i submit a PR?

@d4tocchini
Copy link
Author

@mogill
Copy link
Owner

mogill commented Nov 9, 2018

From the description of your use case it doesn't sound like EMS brings anything to the table, and of course encoding binary data as a base64 string would be unwanted overhead. EMS presently exposes JSON data types for which copy-in/copy-out from the JS runtime is, by definition, unavoidable. The EMS implementation already does store arbitrary byte vectors so adding the EMS_TYPE_BINARY you describe would be straightforward and would be nearly identical to the EMS_TYPE_STRING implementation.

Resizing a typedarray buffer is another matter as it would need to be done by EMS, not JS, in order to be parallel-safe. To that end, JS is free to relocate typedarray buffers just like any other heap data, meaning they are not safe for parallel access. SharedArrayBuffers are documented as unstable, and progress on defining a parallel memory model has been stalled for years.

If you're "shocked" EMS is not part of Node Workers, imagine how I feel about Node Workers' authors deleting my recommendations from their request for comments.

If you submit a PR that converts EMS to NAPI, and/or a PR that adds EMS_TYPE_BINARY I would be happy to merge it.

@mogill mogill mentioned this issue Nov 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants