Skip to content

Browser Parallelizing

Levi Thomason edited this page Sep 12, 2015 · 5 revisions

Parallelizing Anny

Goal

Neuron output values are calculated in series, one at a time. Browsers now support parallelizing tasks across multiple CPU cores via Web Workers. The goal of this experiment was to increase training performance by calculating all neurons in a given layer in parallel, up to 8 at a time no an 8 core CPU.

Results

The computation time lost managing Web Workers and passing data between them outweighed the gains of calculating neurons in parallel. It is terribly slower, up to 100 times slower depending on the method used.

Approaches

There are limitations to serialization in the browser using Web Workers. The salient points to this case were:

  • Web Workers cannot sharing memory.
  • Communication to and from Web Workers is limited to postMessage (similar to iframes).
  • Structured cloning is used to post data, functions cannot be passed without stringify / eval.

Due to these limitations, the following approaches were attempted.

Serializing and message posting had to happen every time every neuron was activated. A simple OR gate has a minimum of 4 neurons (2 inputs, 1 bias, 1 output). Anny trains an OR gate to an error of 0.001 reliably in ~65 epochs. That is, it takes ~250 cycles through the 4 possible input combinations of an OR gate to learn to approximate its function. So, 4 neurons * 4 input combinations * 65 epochs = 4000 Web Workers, serialized neurons, and 8000 messages posted (1 in, 1 out).

Serializing

The first approach was to serialize only the data necessary to calculate the neurons output. Essentially, this involved making a shallow copy of a nearly complete neuron, including its connections, stringifying its activation and derivative functions, posting that to a worker for calculation, and receiving the result back.

This allowed neurons to be calculated in parallel. Though, as you can imagine, the time required to serialize the neuron and all of its immediate connections was cumbersome. The result was a much slower training session that a single threaded approach. On average, training an OR gate with this method took ~16 seconds. Compared to ~45ms when done in series. Note, this was done using paralle.js.

The branch that implemented this is available here: https://github.com/dev-coop/anny/tree/feature/parallelize

Instantiation

The next logical question was, "Hey, what if the entire neuron class lived in the worker already?" This avoids having to serialize an entire neuron and all of its connections for every activation. The entire neuron class would just live in a web worker to start with. To activate it, you could simply post its input as a message. Also, we just use vanilla Web Workers this time, to remove as much overhead as possible.

An example can be seen here: https://github.com/levithomason/worker-test

This produced much better results, but still terrible as you can see. This is a contrived example as a full net was not trained. Instead, the activation time alone was timed. The successful training time is not important since the time required to activate an equal number of neurons an equal number of times would be the same. The only difference would be the activation values.

Potential Benefits

There are still gains and potential performance improvements that can be had from parallelizing in the browser.

Freeing up the UI

There are still some benefits as tasks that are running in parallel do not block the main thread. This means the browser does not freeze during the large training loop. That is always good. If the entire neural net was constructed in a single parallel thread you would gain this benefit. The time loss would be minimal compared to creating a web worker for every neuron or every activation of every neuron.

Transferrable Objects

Anny was not architecture and code style was intended to serve those just starting with machine learning. Readability and simplicity were chosen over optimization, flexibility, and scalability. Because of this it does not lend itself as well as it could to parallelization.

I can imagine a different archicture designed from the start to take advantage of blazing fast Transferrable Objects that may prove to be useful. This could avoid structured cloning and message posting entirely. This might be a fun next build, if there is some valid use case to justify the time.