Improve performance of presynaptic parallelism and therefore procedural connectivity #410

neworderofjamie · 2021-04-12T10:50:04Z

Even with efficient (no heavy-duty RNG-wrangling) procedural connectivity kernels like our Conv2D model, these kernels are latency bound:

at least we're not memory bound but, with a sense of deja-vu, the profiler shows that the cause of the latency is a new selection of 'stalls':

Short scoreboard stalls are basically due to instructions waiting on reads from the constant cache and, therefore, these can be isolated as coming from the ISETP(integer compare and set) following the LDC(load constant) instruction (shaded section in the disassembly indicates instructions corresponding to the CUDA code highlighted in green):

Wait stalls are inserted around instructions by the compiler to instruct the GPU to wait until the instruction operands are available and, as such, are more or less unavoidable (https://stackoverflow.com/a/66710732/1476754 has a nice description). Essentially, their prevalence here indicates that huge numbers of instructions are being churned through relatively efficiently.

However, what the profiler's not showing here is really the point - all of the time in these kernels (sampled time is the circled column) is being spent on these very low-level issues within the binary search and ID comparison but the actual compute we care about (even the atomic add operations) is so fast the profiler can't even 'see it':

This is essentially because, when spiking activity is sparse, presynaptic parallelism results in a lot of unnecessary threads (a thread per presynaptic neuron is launched whereas number of spikes is typically much smaller than this). Before they can be killed, these threads need to perform a binary search to figure out which merged group they're in and then compare their ID against the presynaptic spike count. These operations are relatively cheap but here they're dominating.

Sticking plaster solutions would be to optimise the binary search by distributing it amongst the threads of a warp or, for moderately-sized models, replacing the binary search with a simple lookup table (i.e. block id to population index). However, neither will address the underlying issue which is that we're launching too many unnecessary threads. Essentially, what will be required is a 'pre-presynaptic update' kernel which performs a prefix sum operation on the number of spikes to be processed, thus calculating the number of threads required for the different parts of the kernel. Finally this kernel could launch the actual presynaptic update kernel with the correct total number of threads using dynamic parallelism.

Note, all this analysis is performed on a VGG16 model (i.e. 16 layers) so all the merged kernel data structures fit easily in constant cache. In a larger model, I would expect "short scoreboard" stalls to be replaced by "long scoreboard" stalls as the merged group start ids array will have to be moved to global memory.

tnowotny · 2021-04-12T11:51:01Z

Very interesting and looks like there is a lot of potential for improvement here. I didn't get the bit
"which performs a prefix sum operation on the number of spikes to be processed" - isn't this information already available in terms of number of spikes emitted in a timestep? I must be missing something ...?

neworderofjamie · 2021-04-12T11:55:31Z

You do indeed know how many spikes each population has emitted, but you need the cumulative sum of these to map that to threads.

tnowotny · 2021-04-12T12:28:43Z

Ah, yes of course - across the merged population group ... wouldn't the neuron kernels be able to make the sum via atomic add upon exit?

neworderofjamie · 2021-04-12T12:40:33Z

I think you could theoretically calculate the cumulative sum using (potentially large numbers of) atomic adds (one for each subsequent population) but, I think, there are numerous better parallel algorithms which should be easy enough to code generate (https://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf)

neworderofjamie · 2021-04-12T12:43:05Z

Actually, https://www.youtube.com/watch?v=mmYv3Haj6uc is a much better explanation

tnowotny · 2021-04-12T12:46:07Z

Sure - of course, but are there actually so many populations that that makes any significant difference I wonder? An atomic add per neuron population seems like a very small addition to what neuron kernels do ... just a thought ...

neworderofjamie · 2021-04-12T12:51:17Z

The problem is that it wouldn't be a single atomic add per neuron population, it would be more like (where pop is the index of the population this thread is processing and numSpikes is how many spikes it emitted):

if(lid == 0) {
    for(i = pop; i < npop; i++) {
        atomicAdd(&cumulativeSum[i], numSpikes);
        
    }
}

tnowotny · 2021-04-12T13:08:12Z

I was more thinking about the individual populations having a thread each doing the job, kind of like (not sure about the variable names, I hope it is clear what I mean):

if (spike == 0 && thread == 0) {
     atomicAdd(&cumulativeSum, numSpikes[<pop id>]);
}

... by gut feeling, the atomic adds aren't that likely to clash, would they?

neworderofjamie · 2021-04-12T13:16:54Z

But that would only give the total number of spikes whereas you need the cumulative sum like:

num_pop0, num_pop0 + num_pop1, num_pop0 + num_pop1 + num_pop2

neworderofjamie · 2021-04-14T14:11:02Z

Initial exploration using toy GeNN kernel to reproduce equivalent situation with 50 populations of 5000 neurons with varying levels of activity looks promising:

'Static thread' refers to current approach, 'Oracle const' refers to start indices calculated to match spikes but still stored in constant cache and 'Oracle global' is the same but with indices in global memory (as they'll have to be for another kernel to generate them via a scan operation)

neworderofjamie added enhancement CUDA backend labels Apr 12, 2021

neworderofjamie added the OpenCL label Aug 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of presynaptic parallelism and therefore procedural connectivity #410

Improve performance of presynaptic parallelism and therefore procedural connectivity #410

neworderofjamie commented Apr 12, 2021 •

edited

tnowotny commented Apr 12, 2021

neworderofjamie commented Apr 12, 2021

tnowotny commented Apr 12, 2021

neworderofjamie commented Apr 12, 2021 •

edited

neworderofjamie commented Apr 12, 2021

tnowotny commented Apr 12, 2021

neworderofjamie commented Apr 12, 2021

tnowotny commented Apr 12, 2021

neworderofjamie commented Apr 12, 2021

neworderofjamie commented Apr 14, 2021 •

edited

Improve performance of presynaptic parallelism and therefore procedural connectivity #410

Improve performance of presynaptic parallelism and therefore procedural connectivity #410

Comments

neworderofjamie commented Apr 12, 2021 • edited

tnowotny commented Apr 12, 2021

neworderofjamie commented Apr 12, 2021

tnowotny commented Apr 12, 2021

neworderofjamie commented Apr 12, 2021 • edited

neworderofjamie commented Apr 12, 2021

tnowotny commented Apr 12, 2021

neworderofjamie commented Apr 12, 2021

tnowotny commented Apr 12, 2021

neworderofjamie commented Apr 12, 2021

neworderofjamie commented Apr 14, 2021 • edited

neworderofjamie commented Apr 12, 2021 •

edited

neworderofjamie commented Apr 12, 2021 •

edited

neworderofjamie commented Apr 14, 2021 •

edited