Data parallelism for GPU (and CPU) #168

rcoreilly · 2023-02-12T21:16:17Z

We should get major efficiencies and speedup by running multiple "data" pathways through the same synaptic weights and network architecture. In effect, it is like "shared weights" for multiple copies of the same network: Neurons, Pools, LayerVals, Exts, Context ("state") is replicated D times, and each one is fed a different set of inputs. Because there is only one set of weights, the weight changes naturally accumulate in parallel (DWt must have appropriate agg logic).

This data-parallel is already used in the MPI versions of models like LVis and even protobrain, so we know it works well. It amounts to a "mini batch" kind of backprop learning, and D = 16 typically works well, but going too much higher tends to not work well because the DWt changes cancel out too much.

This can turn even a small network effectively into a big one where the GPU overhead becomes worth it. The biggest difficulty with this is in managing the data input part of it, when not using simple tabular datasets.

A GPU-only version of this can be done very easily by just extending an outer loop on the state arrays as stored in the Network. The CPU only sees the 0 index, and, per below, the GPU can very easily just process the outer D dimensions in parallel. Will start with this probably just to get a quick first pass up and running.

However, from a memory cache read / write perspective, it should be faster to organize the state data the other way around, with D as the inner dimension, so a warp of processors is getting sequential bytes (also need to turn array-of-structs into struct-of-arrays to really make that relevant). To do this, we'd have to bake the D dimension into the Shape of each layer in one way or another. Given all the ways in which the compute and connectivity logic depends on the 4D or 2D shape, it probably makes sense to just have a separate accessor method used for actually grabbing the network state based on the standard shape geom, plus the d index that can be stored in the Context. We could then try it both ways and see how much difference it makes.

Bottom line is that we really need to do a systematic benchmark test to see how much diff this makes before committing to something more invasive. One possibility while I'm thinking about it is that the NVIDIA A100 is really sensitive to the memory layout dynamics and somehow the M1 is not? In current impl with array-of-structs, M1 is much faster than the A100, which should not be the case "on paper".

On the GPU, this will be nearly trivial to implement:

Use the 'Y' index on compute for the 'd' data index. Can typically go to 16x data, so 4x16=64 warp geom. This naturally allows each warp of threads to share the same synapses and index accesses, and optimizes D-parallel access if we organize D as the inner dimension.
All state access just needs a d index and appropriate wrapper methods -- easy.
Interestingly, will need to break Synapse into separate SynCa and SynWts structs, with SynCa replicated D times -- this is dynamic state, but SynWts are shared.
DWt just loops over D networks per synapse and aggregates changes -- easy.

For CPU, assuming we put D in the inner loop, it would make sense to follow the same logic and iterate over D per thread (goroutine), so each one is processing the same local memory. First pass, when only doing D parallel on GPU, the CPU just does a few things to help the GPU -- perhaps makes sense to just go ahead and put the rest of the stuff on GPU so it is cleaner.

To feed the beast, we need parallel data. For simple case of static tabular data, this is trivial. For dynamic environments, need to replicate D copies of the env, and ensure that each one is doing something different (can use the nice slrand index-based random number generator to keep each one different in a controlled manner). In the existing MPI impl, the replication happens by having D separate instances of the entire sim running, and then using the mpi proc id to make each one do something different. Here, we just need to manage this all within one process. The key is just to ensure that the Env code is fully encapsulated, and also has the proper knobs for ensuring different samples come from each.

The text was updated successfully, but these errors were encountered:

siboehm · 2023-02-13T19:03:37Z

Definitely makes sense to have D be the inner loop (this is just how minibatching works in deep learning).
When we first talked about this 2 weeks ago, I didn't realise that we'd have to duplicate a lot of the Synapse state, which makes the whole scheme much less attractive, since most of the memory footprint is due to the synapses and we'd have to duplicate ~50% of it (all the SynCa variables). Example: LVis is 9MB Neuron state, 1.5GB of Synapse state, makes 1.5GB total. At D=16, we'd get 16x9MB (neglegible) + 750 MB (Syn Wt) + 16x750MB (SynCa) = 13GB.
CPU should profit from this too, as our NeuronFun multithreading is held back by threading overhead, and going from D=1 to D=16 is like 16x more Neurons.
Implementing this seems very low priority, since we have so many machines that we're not using and could just MPI-AllReduce across them (which you have already implemented). In fact, I'd argue this would be the first step: Implement distributed data parallelism (DDP) via MPI for boa on egan, to see if DDP even works (=models still converge).

rcoreilly mentioned this issue Apr 14, 2023

Flexible neuron, synapse data layout #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data parallelism for GPU (and CPU) #168

Data parallelism for GPU (and CPU) #168

rcoreilly commented Feb 12, 2023

siboehm commented Feb 13, 2023 •

edited

Data parallelism for GPU (and CPU) #168

Data parallelism for GPU (and CPU) #168

Comments

rcoreilly commented Feb 12, 2023

siboehm commented Feb 13, 2023 • edited

siboehm commented Feb 13, 2023 •

edited