Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

I plan on porting this SDK to CUDA 7.0 from the ancient OpenCL implementation. Few quick questions. #29

Open
LouisCastricato opened this issue Apr 11, 2015 · 17 comments

Comments

@LouisCastricato
Copy link

Are there any gotchas in the original implementation? I plan, at least initially, to make my port as similar as possible to the original code just to get it functioning first (Eg: No dynamic parallelism, and minimal shared memory.) Anything I should look out for?

I'll probably do this sometime next week and (hopefully) create a pull request before the end of the month.

I'd port it to a newer version of OpenCL, but no reason bothering until Vulkan comes out :P

@LouisCastricato
Copy link
Author

Actually, a few more questions. Whats OP_REC2 and OP_REC1? Are you doing a tree transversal with those, and they're showing which way it branches next? (Sorry if I create a plethora of questions, porting uncommented code often leads to confusion on all ends haha.)

@stephanemagnenat
Copy link
Collaborator

I made the current OpenCL version, but actually it is slower than CPU even for GPUs with many ALUs. The reason is that I used a suboptimal strategy: I reimplemented the stack-based recursion mechanism in OpenCL but creating a software stack, which means that all ALUs in a group would wait until the longest execution path has terminated. This is clearly not the good solution.

What is your parallelization strategy for the CUDA implementation?

@stephanemagnenat
Copy link
Collaborator

Actually, a few more questions. Whats OP_REC2 and OP_REC1? Are you doing a tree transversal with those, and they're showing which way it branches next?

This code is almost 5 years old, so I have to re-interpret it almost as much as you ;-) The op variable in the simulated recursion stack is actually a form of program counter: it tells, for a given level in the stack, where the function was at that level when it called the recursion. As recursion can be called in two places only, there are three values for the op variable:

  • OP_BEGIN_FUNCTION: beginning of function
  • OP_REC1: first recursive call
  • OP_REC2: second recursive call

Feel free to ask further questions if this answer is not clear. Also, if you make a better strategy in CUDA, it would be great to back-port it to OpenCL.

@stephanemagnenat stephanemagnenat changed the title I plan on porting this SDK to CUDA 7.0 from the ancient OpenCL implementation. Few quick quesitons. I plan on porting this SDK to CUDA 7.0 from the ancient OpenCL implementation. Few quick questions. Apr 11, 2015
@LouisCastricato
Copy link
Author

Well if it wasn't already I planned to change how the trees themselves were stored before being copied to the gpu. I was going to make sure the gpu version was a linear stack of nodes. It seems that doing a search against the leafs is quite a bad thing to let a gpu try by default (after an hour or two of research.).

The approach I was considering would limit depth to 22 in most cases, so I need to go back to the drawing board. I was going to recursively call the kernel every time I followed a branch. I think there is a way to alleviate this limit, but I can't test that since I recently bricked my titan and only have and 760 to play around with. The 760 only supports 22 depth at most.

I was also considering presorting search keys if libnabo doesn't do that already. That alone could create a 3x performance increase gpu side and perhaps an equally as big increase cpu side. So it would almost certainly improve performance for even the normal CPU side of things.

I'm not entirely sure. I need to do more research into exactly how the library works.

Oh just a heads up, I will primarily be testing the code that I write for this on Debian. I could loosely insure that it'll work on Windows, but no solid promises.

@LouisCastricato
Copy link
Author

Feel free to ask further questions if this answer is not clear. Also, if you make a better strategy in CUDA, it would be great to back-port it to OpenCL.

I would but I haven't done any work with OpenCL in years :( I'm very rusty, and I know I would do a far better job in CUDA. Few other questions though.

//How and where is tree splitting done? I've seen a fair share of approaches to GPGPU KD trees, and //almost always the split length is given to the GPU to help speed things up a fair bit. Is it there, just //under a different naming convention?
Never mind I found it

Also, I am going to rewrite the sort function as well. It would not perform well at all in CUDA :P (Seeing how there is no sort function haha)

@LouisCastricato
Copy link
Author

Okay I've got a really good architecture in mind (Snr Software architect, I usually nerd out more about the architectures than I do the actual code haha). Thought I'd share it.
While constructing the KD tree:

  1. Sort search keys (Really crucial)
  2. Either we can dynamically shift the bucket size on a per node basis, or we can limit its size greatly to about 10 - 15. The larger this bucket size, the more it'll hurt the system. I have a few ideas on dynamically shifting the bucket size. Most of them would be done through boosting. For a lot of reference points, the build time would be awful. It should be fine for < 100K reference points though if done correctly.
    When searching:
  3. Order query points through some form of clustering mechanism to reduce warp divergence. Use ballot when the kernel is executing to make sure that most threads within a warp are following at least a similar path. Store similar paths in shared memory so that they can be used by other threads in the warp.
  4. When a path proves to be too divergent, launch another kernel via dynamic parallelism and split the query cluster that the warp was executing over. This should take care of a decent bit of overhead caused by ballot. Although, we should hope that no more than 2200 paths become divergent.

I think (At least I hope) this would be roughly 9x faster than the CPU version. There is hopefully very little coalescing, most memory accesses would be shared. The memory overhead should be small since most of this would be copy once and forget.

There are a few more ways to expand this though mostly with how querying is done. I could implement a quick merge sort that runs over the query points before the search function executes. That would, for most low dimension cases, be fine (Eg: k < 20 or so). But beyond that, it would be awful and I would highly recommend that some form of clustering algorithm is used. (I would advise against SVMs. I always get awful performance no matter what the implementation is.)

This really only works for breadth first though. I was trying to incorporate as much caching as possible in order to reduce potential memory issues with larger trees. If the tree's depth is significant, I can have a second mode for depth first searching.

@LouisCastricato
Copy link
Author

I also have a few more very complex ideas and I'm not sure if they'll work but I want to try them (Eg: GPU based clustering and broadphase collision checking against the KD tree. Any node that doesn't collide with us is skipped. This could potentially greatly reduce the number of steps it takes to find where we are in the KD tree, but clustering like this may take a while)

Edit: They didn't work :(

@LouisCastricato
Copy link
Author

Unfortunately OpenCL does not support dynamic parallelism or ballots. Although I do believe that Vulkan does, so this is something that should be easily portable once Vulkan becomes available to the general public.
Either way this is quite an easy port now that I have a mostly full grasp of whats going on and how to fix it. This is extremely basic ML, so I think I'm going to just write some utility functions rather than using a CUDA library to do it for me. Clustering shouldn't be that hard. I'll probably start off by doing a simple bounding sphere approach.

Also:
const T diff = q[cd] - p[cd];
// TODO: use approximate dist
// FIXME: bug in this early out
//if (diff * diff > heapHeadValue(heap))
// continue;

What was the bug?

And:
I implemented a draft: LouisCastricato@6b0168c

It works (Excluding the fact that it crashes on 3/4ths of the runs, and most branches don't function haha).

Edit:
https://github.com/MatrixCompSci/libnabo/blob/7f27d7220d3f900cfa44502fb9de0d6e7ea2fa13/nabo/cuda/kd_node.cu Cleaned it up a ton and optimized. Next step is dynamic parallelism :D

@stephanemagnenat
Copy link
Collaborator

What was the bug?

I do not remember, it seems right while reading it now, but it is four years old.

@LouisCastricato
Copy link
Author

So I have some news. In most cases it is significantly slower than the CPU, since I can only do so much to prevent divergence. But in some cases it does fully match. It hasn't fully exceeded CPU performance.

It also doesn't help that the CPU I'm running this on is one of the fastest single threaded CPUs available on the market, and the GPU is quite awful haha. I need to finish hooking it into libnabo, but I'll push after that.

@stephanemagnenat
Copy link
Collaborator

One strategy that would certainly work well with GPU is to do multiple search in parallel without backtracking. That is the strategy used by FLANN. The bad thing is that it leads to non-deterministic and unbounded error, while the epsilon-based approximation allows a controlled and bounded error. But backtracking is really bad on GPU. I imagine that people have been researching other strategies. I am currently not working on that question any more, but it seems that a Google Scholar search for kd-tree GPU leads interesting results.

@LouisCastricato
Copy link
Author

I tried countless approaches, and nothing seemed to create any form of benefit over a CPU implementation :/

It may be possible, it it would require more research than I am willing to dedicate right now. I'll push some sample CUDA code to the main repo if you want so people can tinker around with it.

@simonlynen
Copy link
Collaborator

@MatrixCompSci thanks for the status update And thanks for spending time on this. It would be great if you could push your state and a description of your experiences on a pull-request so we can archieve it.

Thanks again for spending time on this

@YoshuaNava
Copy link

@LouisCastricato Hope it is OK to re-surface this issue. Your experiments were very cool and had a strong potential to improve performance of the library (which is already quite good).

I have a few questions. Did you try or consider

  1. Running linear search on all the points vs. spatial parttiioning through the Kd-tree?
  2. Building the Kd-tree on the CPU and transferring it into the GPU?
  3. Individual timing of the different K-NN steps?
  4. Any performance analysis to understand whether you are memory or compute bound atm?

Thank you very much in advance.

Best,
Yoshua Nava

@LouisCastricato
Copy link
Author

LouisCastricato commented Jul 21, 2020

Funnily enough Im actually working on a LSH (locality sensitive hashing) variant of this with some folks at ETH Zurich (Seeing that libnabo is also from ETH, I'm collaborating with Bastian Rieck) and a security company.

So for kNN, this method is awful. The one I created here. The vote-ballot architecture does not scale well as you increase k since the cache coherency penalty is massive.

Linear searching through all the points is awful. The key to preserving cache and SIMD coherency is skipping branches. Thats why you can use LSH. Lets say that your data is N dimensional. Provably, you can construct an n << N dimensional coordinate space using LSH such that you can determine the first n < m < N branches that your data is under.

The dataset I am working with is in the range of 5PB - 10PB (Cant give specifics, sorry). If we were to construct a tree such that the GPU kernel thread blocks diverged early on, that would mean we would be heavily straining the bandwidth of the card as well as straining the network infrastructure and losing all hope of SIMD (given a nonbalanced tree). Said being, assuming we can predict the first m branches we can greatly reduce the strain on the GPU.

This is the main issue that I found when trying to port libnabo to CUDA. The strain was massive due to early on divergence. As such, the best way to do this would be to cluster on hashes and then execute kNN.

So, that answers (2). Don't use a kd-tree. Atleast, do not use exclusively kd-trees. You won't get anywhere. When trying to make this parallel your number one priority is figuring out how to order the queries optimally.

(3) The later steps are by far the slowest since thats when all the threads are diverged. Increasing performance here requires decreasing divergence early on.

(4) The bound is like 50% memory bandwidth and 50% SIMD vs MIMD. Bastian and the fellow at the security company are discussing more abstract methods that I dont know if I am allowed to discuss to predict early branches.

Personally I am about to start a CS+Math PhD (wow have things changed since I posted this) so I might eventually come back to something like this. I doubt it would be libnabo in specific.

Edit: Clustering is faster because it removes the memory bandwidth penalty for the first m steps, and it diminishes the cache coherency penalty after that.

@YoshuaNava
Copy link

YoshuaNava commented Jul 22, 2020

@LouisCastricato Thank you for your thorough and prompt reply.

I'm writing a few answers/questions below your comments:

So for kNN, this method is awful. The one I created here. The vote-ballot architecture does not scale well as you increase k since the cache coherency penalty is massive.

  1. Are you trying to perform KNN on a similar test case as the one for which libnabo was designed? i.e. 3D point clouds where you want to find 1 match per point, and the dimension of points is at most 3. In the original paper that introduced libnabo, there was a really good comparison with FLANN, ANN and an octree-based strategy, as well as an explanation for many design aspects: https://hal.archives-ouvertes.fr/hal-01143458/document

  2. Do you aim to run K-NN completely on the GPU? Do you think that's the best strategy?

Linear searching through all the points is awful. The key to preserving cache and SIMD coherency is skipping branches. Thats why you can use LSH. Lets say that your data is N dimensional. Provably, you can construct an n << N dimensional coordinate space using LSH such that you can determine the first n < m < N branches that your data is under.
The dataset I am working with is in the range of 5PB - 10PB (Cant give specifics, sorry). If we were to construct a tree such that the GPU kernel thread blocks diverged early on, that would mean we would be heavily straining the bandwidth of the card as well as straining the network infrastructure and losing all hope of SIMD (given a nonbalanced tree). Said being, assuming we can predict the first m branches we can greatly reduce the strain on the GPU.

  1. So you are saying that the algorithm is bound by the front-end (branch prediction misses). Does this apply to both CPU and GPU?

This is the main issue that I found when trying to port libnabo to CUDA. The strain was massive due to early on divergence. As such, the best way to do this would be to cluster on hashes and then execute kNN.

  1. How do you define divergence in this context? Is it an excess of branching in the branch-and-bound process?

Clustering is faster because it removes the memory bandwidth penalty for the first m steps, and it diminishes the cache coherency penalty after that.

  1. You mean point clustering, as an alternative method to partition space, instead of the Kd-tree?

Sorry if some of the questions are too basic. I'm not a K-NN expert, so I'd like to make sure that I understand what you mean in full.

On a separate note, I've run libnabo on 3D point clouds through Intel VTune (hotspots, uarch, mem-access) and Advisor. Something quite interesting is that the footprint of this library is super small, as mentioned in the paper. It is really well designed. But it seems to have some back-end bounds (execution units parallelized on multiple cores through OMP taking longer than expected), which are much more influential if data is not aligned properly (25x more). This is only an initial exploration, I'll resume this as I also get more deeper knowledge on the algorithmic level, otherwise any low-level optimization might be low gain.

Best,
Yoshua Nava

@YoshuaNava
Copy link

Kind ping @LouisCastricato :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants