Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of presynaptic parallelism and therefore procedural connectivity #410

Open
neworderofjamie opened this issue Apr 12, 2021 · 10 comments

Comments

@neworderofjamie
Copy link
Contributor

neworderofjamie commented Apr 12, 2021

Even with efficient (no heavy-duty RNG-wrangling) procedural connectivity kernels like our Conv2D model, these kernels are latency bound:
vgg_not_memory_bound
at least we're not memory bound but, with a sense of deja-vu, the profiler shows that the cause of the latency is a new selection of 'stalls':
vgg_warp_states
Short scoreboard stalls are basically due to instructions waiting on reads from the constant cache and, therefore, these can be isolated as coming from the ISETP(integer compare and set) following the LDC(load constant) instruction (shaded section in the disassembly indicates instructions corresponding to the CUDA code highlighted in green):
vgg_source__sass_short_scoreboard_stall
Wait stalls are inserted around instructions by the compiler to instruct the GPU to wait until the instruction operands are available and, as such, are more or less unavoidable (https://stackoverflow.com/a/66710732/1476754 has a nice description). Essentially, their prevalence here indicates that huge numbers of instructions are being churned through relatively efficiently.

However, what the profiler's not showing here is really the point - all of the time in these kernels (sampled time is the circled column) is being spent on these very low-level issues within the binary search and ID comparison but the actual compute we care about (even the atomic add operations) is so fast the profiler can't even 'see it':
vgg_source_2
This is essentially because, when spiking activity is sparse, presynaptic parallelism results in a lot of unnecessary threads (a thread per presynaptic neuron is launched whereas number of spikes is typically much smaller than this). Before they can be killed, these threads need to perform a binary search to figure out which merged group they're in and then compare their ID against the presynaptic spike count. These operations are relatively cheap but here they're dominating.

Sticking plaster solutions would be to optimise the binary search by distributing it amongst the threads of a warp or, for moderately-sized models, replacing the binary search with a simple lookup table (i.e. block id to population index). However, neither will address the underlying issue which is that we're launching too many unnecessary threads. Essentially, what will be required is a 'pre-presynaptic update' kernel which performs a prefix sum operation on the number of spikes to be processed, thus calculating the number of threads required for the different parts of the kernel. Finally this kernel could launch the actual presynaptic update kernel with the correct total number of threads using dynamic parallelism.

Note, all this analysis is performed on a VGG16 model (i.e. 16 layers) so all the merged kernel data structures fit easily in constant cache. In a larger model, I would expect "short scoreboard" stalls to be replaced by "long scoreboard" stalls as the merged group start ids array will have to be moved to global memory.

@tnowotny
Copy link
Member

Very interesting and looks like there is a lot of potential for improvement here. I didn't get the bit
"which performs a prefix sum operation on the number of spikes to be processed" - isn't this information already available in terms of number of spikes emitted in a timestep? I must be missing something ...?

@neworderofjamie
Copy link
Contributor Author

You do indeed know how many spikes each population has emitted, but you need the cumulative sum of these to map that to threads.

@tnowotny
Copy link
Member

Ah, yes of course - across the merged population group ... wouldn't the neuron kernels be able to make the sum via atomic add upon exit?

@neworderofjamie
Copy link
Contributor Author

neworderofjamie commented Apr 12, 2021

I think you could theoretically calculate the cumulative sum using (potentially large numbers of) atomic adds (one for each subsequent population) but, I think, there are numerous better parallel algorithms which should be easy enough to code generate (https://www.eecs.umich.edu/courses/eecs570/hw/parprefix.pdf)

@neworderofjamie
Copy link
Contributor Author

Actually, https://www.youtube.com/watch?v=mmYv3Haj6uc is a much better explanation

@tnowotny
Copy link
Member

Sure - of course, but are there actually so many populations that that makes any significant difference I wonder? An atomic add per neuron population seems like a very small addition to what neuron kernels do ... just a thought ...

@neworderofjamie
Copy link
Contributor Author

The problem is that it wouldn't be a single atomic add per neuron population, it would be more like (where pop is the index of the population this thread is processing and numSpikes is how many spikes it emitted):

if(lid == 0) {
    for(i = pop; i < npop; i++) {
        atomicAdd(&cumulativeSum[i], numSpikes);
        
    }
}

@tnowotny
Copy link
Member

I was more thinking about the individual populations having a thread each doing the job, kind of like (not sure about the variable names, I hope it is clear what I mean):

if (spike == 0 && thread == 0) {
     atomicAdd(&cumulativeSum, numSpikes[<pop id>]);
}

... by gut feeling, the atomic adds aren't that likely to clash, would they?

@neworderofjamie
Copy link
Contributor Author

But that would only give the total number of spikes whereas you need the cumulative sum like:

num_pop0, num_pop0 + num_pop1, num_pop0 + num_pop1 + num_pop2

@neworderofjamie
Copy link
Contributor Author

neworderofjamie commented Apr 14, 2021

Initial exploration using toy GeNN kernel to reproduce equivalent situation with 50 populations of 5000 neurons with varying levels of activity looks promising:
Static thread and Oracle const
'Static thread' refers to current approach, 'Oracle const' refers to start indices calculated to match spikes but still stored in constant cache and 'Oracle global' is the same but with indices in global memory (as they'll have to be for another kernel to generate them via a scan operation)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants