Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KMeans|| on GPU #642

Open
mdymczyk opened this issue Jul 12, 2018 · 0 comments · May be fixed by #650
Open

KMeans|| on GPU #642

mdymczyk opened this issue Jul 12, 2018 · 0 comments · May be fixed by #650
Assignees

Comments

@mdymczyk
Copy link
Contributor

Currently the KMeans|| initialization algorithm is performed on the CPU (https://github.com/h2oai/h2o4gpu/blob/master/src/gpu/kmeans/kmeans_h2o4gpu.cu#L367) which is a major bottleneck in cases where the data is large (for example Homesite Kaggle dataset).

It would be beneficial to write a GPU version of it and use it when the data is large enough.

Thinks to take into account:

  1. test if it's worth running it on the CPU in certain cases or should we always run it on the GPU if the rest of the algorithm will run on the GPU also?
  2. if possible it would be great to pass the data to the GPU only once and use it for both kmeans|| and the rest of the algorithm so we don't move the data all the time
  3. kmeans||, especially if the number of clusters is large, can be very memory hungry (as it will be calculating distances for p * k clusters where k is the number of clusters specified by the user and p is a probability based variable larger than 1). This might be one of the reasons to keep the calculations on the CPU in some cases.
  4. do benchmarks when done
@trivialfis trivialfis linked a pull request Jul 20, 2018 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants